Content from Introduction to Networks


Last updated on 2025-02-10 | Edit this page

Estimated time: 12 minutes

Download Chapter notebook (ipynb)

Download Chapter PDF

Mandatory Lesson Feedback Survey

Overview

Questions

  • How are graphs represented?
  • How can we construct a network with NetworkX?
  • How can a network be quantified?

Objectives

  • Understanding the concept of a network graph
  • Creating network matrices and importing network data
  • Quantifying network properties
  • Visualising networks

What is a network?


A graph is a mathematical representation that describes the relationships between objects. A simple graph (such as the one pictured, below) typically contains two types of objects:

  • Nodes - also referred to as vertices, these are the entities of interest.

  • Edges - also referred to as links or arcs. These form the bridges between objects.

In the pictured ‘simple’ network, there are four nodes, connected by four edges. Notice that in this network, the edges represent simple connections, without any ‘flow’ of signal implied. Thus, this network is referred to as an undirected network.

Types of Network:

Besides representing the connections between nodes, networks can also contain other properties at both node and edge level. In the previous example, we saw that an edge can be represented as a simple connection, with undirected edges. Let’s take a closer look at a few other types of edge:

Edge Properties:

  • Undirected edges: these are simple connections, that merely link two or more nodes without an implied flow of signal. In biological data, a common example might be in protein–to-protein interaction (PPI) networks, for example. We will actually learn more about these, later in these lessons.

  • Directed edges: these are connections with a clear flow of signal implied. A common biological example is in metabolic or gene regulatory networks.

  • Weighted edges: these can be undirected or directed edges, and have a weight or quantitative value associated with them. As an example, these weights can conceptually represent properties such as the reliability of an interaction, the correlation between linked gene expressions, or even sequence similarity between two genes. Edges can also be weighted to reflect topological parameters, such as centrality scores and clustering.

  • Bipartite edges: these entail a more complex type of edge, connecting nodes from different groups in a network. Nodes of the same type are not at all connected in a network with bipartite edges. This edge type is often used to transform biological networks to study common components across nodes, and represent complex relationships across different types of data: for instance, gene-disease and enzyme-reaction associations in metabolic pathways.

Node Properties:

Similar to edge properties, nodes can also contain additional information other than the name of the entity. Below are a few examples of different properties nodes can have, that are commonly found in biological networks:

  • Attribute: This can be a quantitative or qualitative value, and can define the characteristic of the entity in question. For instance, in a complex gene regulatory network, nodes may have group labels showing whether or not a node might be a transcription factor regulator, or a target gene. Attributes such as these, can also contain other information; anything from time and colour, through to centrality scores (a topological parameter discussed in the next bullet point).

  • Topological parameters: a specific type of node attribute, topological parameters are of particular interest in this chapter, as they are mathematical properties that represent the spatial relations of nodes in a network, both globally (as in, the position of a node relative to the entire network) and locally (a node relative to other nodes to which it is directly connected). These properties are particularly important in interpreting the characteristic of the entity relative to the network, itself. Topological parameters will be discussed in more depth, further in this chapter.

Applications of Networks in the Life Sciences:


Networks are widely used in biology to represent complex relationships. We have previously referenced protein-protein interactions (PPI) as one such example, whereby nodes represent an individual protein, and the edges connected to it, represent its interactions with other proteins in the network.

Undirected Network Example:

Continuing in this vein, let’s look at a PPI network of Granzyme B (GZMB) in humans, extracted from the STRING-DB database:

In this diagram, we can see the PPI network centers around the GZMB protein in Homo sapiens, where the edges connect proteins to one another. Note that the edges here have no arrowheads on them, and thus represent an undirected network.

Granzyme B is secreted by immune cells, and mediates apoptosis (or programmed cell death). GZMB is often secreted in conjunction with another protein called perforin, which facilitates the formation of a pore in the target cell membrane. In order to trigger apoptosis, GZMB cleaves and activates initiator caspase enzymes - two such caspases are CASP3 and CASP9. Thus, the network clearly shows an interaction between GZMB and both CASP3 and CASP8. If you would like to learn more about the interactions in this network, you can find out more on STRING-DB on the following link.

Directed Network Example:

As an example of a directed network, let’s take a look at the signalling regulation in Philadelpha chromsome positive Chronic Myeloid Leukemia (Ph+CML), referenced from Chuang et al., 2015.

In this network, key proteins and pathways dysregulated in CML are represented in capsule-shaped and rectangular nodes, respectively. Unlike the earlier GZMB PPI network example, this network contains two examples of directed interactions: activating (green) and inhibitory (red). All interactions represent a directed flow of information, indicating both the regulator and effector of the interaction.

Ph+ CML is characterised by the abnormal translocation between chromosomes 9 and 22, resulting in the fusion gene of BCR-ABL1. The presence of BCR-ABL1 leads to the upregulation of pro-survival and proliferation pathways such as JAK/STAT signalling (as indicidated by BcrAbl activating Jak2, STAT3, and STAT5 in the network shown), whist bypassing cell cycle checkpoint inhibitors in the Ras/MAPK/ERK pathway. BRC-ABL fusion also inhibits apoptosis through direct and indirect interactions with pro- and anti-apoptotic proteins. The importnace of BCR-ABL in Ph+CML drives the development of tyrosine kinase inhibitors, including lmatinib, Dastinib, and Nilotinib, to diminish CML tumour cell growth in CML patients.

Bipartite Network Example:

Bipartite networks are usually used to study the association between nodes of the same type. An example of such an application is referenced in Yang et al., 2014.

In this diagram, we can see a portion of one of the networks constructed in this study. The initial IncRNA-disease association network is a bipartite network, inferring the IncRNAs associated to Alzheimer’s disease (AD) and breast cancer (BC). From the bipartite network, both CDKN2B-AS1 and BC200 are associated to both AD and BC. Based on this, a IncRNA-implicated disease network can be constructed, connecting AD and BC with a weighted edge of 2. Similarly, a disease-associated IncRNA network can be constructed connecting the IncRNA nodes CDKN2B-AS1 and BC200.

Further examples:


Disease spread mapping:

Networks can also be used to model how infection diseases spread. In a study published by Moon & Scoglio (2021), an example of a two-layer network model was used to estimate how contact tracing can impact the spread of COVID-19. The network model used comprises a mixture of both weighted and unweighted, undirected networks, with a staggering 445, 350 edges representing the individual households across Manhattan, NYC.

Inferring gene regulation from an omics dataset:

Another popular application of networks is the inference of gene regulatory crosstalk based on single omics or multiple omics data, or even single-cell data. This has been reviewed by Kim et al. (2023). Tehse gene regulatory networks (GRNs) are directed and contain two types of nodes: regulators and gene targets. Capturing the regulatory activity at multi-omics or single-cell levels, allows for a deeper understanding of the regulatory activity driving cellular identity and disease.

Practise Exercise 1

Match the following networks (top) to the term (bottom) that best describes the edge properties of that network.

Answer:

  1. ii
  2. iii
  3. i
  4. iv

Creating graphs with NetworkX:


NetworkX is a Python package for styding networks in Python. In this chapter, we will make use of NetworkX to create networks and study their properties.

Firstly, let’s begin by importing NetworkX:

PYTHON

import networkx as nx

To make networks intelligible to a computer, they can be represented as one of the following:

  • Adjacency matrix: this is where rows and columns are assigned to nodes in the network, and the presence of an edge is indicated by a numerical value.
  • Edge list: this is a list of the edges in a network. This list can be stored in a two-column matrix for directed or undirected networks. For weighted networks, a third column would be present, representing the edge weight.

Network 1

The adjacency matrix of this undirected and unweighted network is represented as a symmetric matrix containing solely 0 and 1 values. This represents the presence and absence of connections, respectively. The edge list of this undirected and unweighted network is represented by two columns of nodes.

Network 2

The adjacency matrix of this undirected and weighted network are represented by a symmetric matrix containing the weights of the edges. Similar to Network 1, when no connection is present, the edge is represented by a value of 0. The list of edges for this undirected and weighted network is represented by two columns of nodes, and a column of the edge weight.

Network 3

The adjacency matrix of this directed and unweighted network is represented as an unsymmetric matrix, containing values of either 0 or 1. The edge list of this undirected and weighted network is represented by two columns of nodes, specifying the source and target, showing the signal flow of the edge in question.

Creating a network using an adjacency matrix:


Next, we will cover how to create network with Python, using an adjacency matrix or edgelist.

In order to create a network from an adjacency matrix, we must firstly construct said adjacency matrix. Based on the Network 1 example given above, let’s create the adjacency matrix variable as a NumPy array.

PYTHON

import numpy as np

Network1_m = np.matrix([[0, 1, 1, 0, 0, 0, 0], #A
                        [1, 0, 1, 1, 0, 0, 0], #B
                        [1, 1, 0, 0, 0, 0, 0], #C
                        [0, 1, 0, 0, 0, 0, 0], #D
                        [0, 0, 0, 1, 0, 0, 0], #E
                        [0, 0, 0, 1, 0, 0, 0], #F
                        [0, 0, 0, 1, 0, 0, 0]])#G

Next, we must convert this adjacency matrix into a network. We can do so using the NetworkX method from_numpy_array, as follows:

PYTHON

Network1 = nx.from_numpy_array(Network1_m)

We can then view the edges of this network, using the edges() attribute; and the nodes using nodes(), as follows:

PYTHON

Network1.edges()

OUTPUT

EdgeView([(0, 1), (0, 2), (1, 2), (1, 3), (3, 4), (3, 5), (3, 6)])

PYTHON

Network1.nodes()

OUTPUT

NodeView((0, 1, 2, 3, 4, 5, 6))

Note that, when using the NumPy array, the names of the nodes (i.e. A, B, C etc.) are not stored in the network. In order to change the node names, we can use the relabel_nodes method together with a mapping dictionary, as follows:

PYTHON

mapping = {0: "A", 1: "B", 2: "C", 3: "D", 4: "E", 5: "F", 6: "G"}
Network1_new = nx.relabel_nodes(Network1, mapping)

Network1_new.nodes()

OUTPUT

NodeView(('A', 'B', 'C', 'D', 'E', 'F', 'G'))

In the case of larger networks with thousands of nodes, the corresponding adjacency matrices would also be very large. For instance, if a network has 1,000 nodes, the adjacency matrix would have a dimension of 1000 x 1000. Most of these nodes may not be connected, and thus the matrix itself could be sparse, and filled with values of 0. Reading this volume of data into memory would prove to be resource-heavy for your computer.

Creating a network using edge lists:


Contrastingly, edge lists are more succinct, and are legible even for larger networks. This is because an absent connection with a value of 0 is not stored as data, unlike in an adjacency matrix; making them computationally more efficient.

In the following cell, let’s create the directed network referenced above (Network 4), using an edge list. Firstly, we must create a Python list that will serve as our edge list, containing elements of paired tuples. We can do this using the function from_edgelist:

PYTHON

Network3_edgelist = [("B", "A"),
                     ("B", "C"),
                     ("C", "A"),
                     ("D", "B"),
                     ("E", "D"),
                     ("F", "D"),
                     ("G", "D")]

Network3 = nx.from_edgelist(Network3_edgelist,
                            create_using = nx.DiGraph()) ## Specify so that the network is directed.

Then, as before, we can observe the edges and nodes in the network:

PYTHON

Network3.edges()

OUTPUT

OutEdgeView([('B', 'A'), ('B', 'C'), ('C', 'A'), ('D', 'B'), ('E', 'D'), ('F', 'D'), ('G', 'D')])

PYTHON

Network3.nodes()

OUTPUT

NodeView(('B', 'A', 'C', 'D', 'E', 'F', 'G'))

Notice that by using the from_edgelist() function, we do not have to assign the names again to the network, using a dictionary.

Reading a Pandas DataFrame into a network:


It is also possible to read in an edge list that has been previously stored as a Pandas DataFrame. Let’s use Network 2 as an example from the previous figure at the start of the lesson. The edge list for Network 2 is stored as a .csv file in your repository’s data folder, and can be read into a Pandas DataFrame using the from_pandas_edgelist function:

PYTHON

from pandas import read_csv

Network2_edgelist = read_csv("./Data/Network2.csv")

Network2_edgelist.head()

OUTPUT

  Node1 Node2  Weight
0     A     B       1
1     A     C       2
2     B     C       3
3     B     D       1
4     D     E       1

PYTHON

Network2 = nx.from_pandas_edgelist(Network2_edgelist,
                                   source = "Node1",
                                   target = "Node2",
                                   edge_attr = True)

As before, we can access the network’s edge and node attributes:

PYTHON

Network2.edges()

OUTPUT

EdgeView([('A', 'B'), ('A', 'C'), ('B', 'C'), ('B', 'D'), ('D', 'E'), ('D', 'F'), ('D', 'G')])

PYTHON

Network2.nodes()

OUTPUT

NodeView(('A', 'B', 'C', 'D', 'E', 'F', 'G'))

The edge weight information can then be accessed using the get_edge_attributes method:

PYTHON

nx.get_edge_attributes(Network2,"Weight")

OUTPUT

{('A', 'B'): 1, ('A', 'C'): 2, ('B', 'C'): 3, ('B', 'D'): 1, ('D', 'E'): 1, ('D', 'F'): 3, ('D', 'G'): 2}

Practise Exercise 2

Create a NetworkX graph object based on the adjacency matrix, from the bipartite ‘Network 4’ example.

PYTHON

Network4_m = np.matrix([[0, 1, 0, 0, 0, 0, 0], #A
                        [1, 0, 1, 0, 0, 1, 1], #B
                        [0, 1, 0, 1, 0, 0, 0], #C
                        [0, 0, 1, 0, 1, 1, 1], #D
                        [0, 0, 0, 1, 0, 0, 0], #E
                        [0, 1, 0, 1, 0, 0, 0], #F
                        [0, 1, 0, 1, 0, 0, 0]])#G

PYTHON

Network4 = nx.from_numpy_array(Network4_m)

Quantifying network properties:


In order to study a network, we can measure how its nodes are positioned using centrality scores. A centrality score is a numerical measure of how important a node is, its influence and/or its structural position within a given network.

Degree distribution:

In an undirected network, the term degree refers to the number of edges connected to any given node. For a directed network, a degree is subdivided to an in-degree and out-degree:

  • In-degree refers to the number of incoming edges to a node.

  • Out-degree refers to the number of outgoing edges from a node.

Although the degree can be calculated rather simply by counting the number of edges connected to a given node, this may be difficult for networks that are large. To illustrate this, we will incorporate the example of the Breast Cancer Network graph, at this point in the lesson.

Breast Cancer Network:

This is a network generated from a list of 202 mutated genes in breast cancer, identified by Repana et al., 2019. The resulting network was constructed from the STRING protein-protein interaction database, and its 3,696 edge weights were calculated from the evidence-based strength of interactions between its nodes. The next lesson will serve to reveal more information about constructing PPI networks based on lists of genes.

The Breast Cancer Network is given in the data folder of your repository, and is stored as a .csv file. Let’s commence by reading this into a Pandas DataFrame, as before:

PYTHON

bcn_df = read_csv("./Data/BreastCancerNetwork.csv")

bcn_df.head()

OUTPUT

   Node1   Node2  Weight
0  ACTG1  ACTL6B   0.914
1  ACTG1    CDH1   0.457
2  ACTG1  CTNNA1   0.953
3  ACTG1  ARID1A   0.841
4  ACTG1   ARID2   0.816

Let’s then convert it into a NetworkX graph object:

PYTHON

BC_g = nx.from_pandas_edgelist(bcn_df,
                               source = "Node1",
                               target = "Node2",
                               edge_attr = True)

We can calculate the degree centrality of each node using the NetworkX degree() function”

PYTHON

degree = nx.degree(BC_g)

print(degree)

OUTPUT

[('ACTG1', 15), ('ACTL6B', 14), ('CDH1', 78), ('CTNNA1', 13), ('ARID1A', 89), ('ARID2', 41), ('SMARCA4', 74), ('AFDN', 15), ('FLNA', 22), ('AKT2', 40), ('SMARCD1', 18), ('PBRM1', 49), ('RHOA', 53), ('AKT1', 114), ('MYC', 126), ('ARID1B', 40), ('NTRK3', 22), ('RUNX1', 74), ('CBFB', 39), ('CCND1', 95), ('DNMT3A', 63), ('H3C12', 85), ('RB1', 40), ('ACVR1B', 8), ('KRAS', 115), ('PDGFRA', 54), ('SMAD3', 54), ('SMAD4', 90), ('TGFBR2', 43), ('BRCA2', 77), ('ALK', 54), ('CTNNB1', 124), ('EGFR', 100), ('KIT', 65), ('GRIA2', 8), ('USP9X', 18), ('MEN1', 49), ('EPHA7', 7), ('KMT2A', 50), ('NRAS', 92), ('HRAS', 82), ('AFF2', 3), ('CBL', 35), ('ATRX', 68), ('AGTR2', 1), ('XBP1', 14), ('AURKA', 40), ('CDKN1B', 56), ('MLH1', 58), ('MSH2', 47), ('FOXA1', 42), ('RARA', 45), ('NOTCH2', 46), ('APC', 19), ('MDM2', 77), ('PIK3C3', 21), ('KMT2C', 66), ('AXIN1', 40), ('CREBBP', 74), ('CCNE1', 52), ('EP300', 87), ('NOTCH3', 29), ('PIK3CA', 105), ('CDK6', 58), ('PMS2', 31), ('DAXX', 35), ('ERBB3', 56), ('NCOR1', 50), ('TP53', 147), ('ERBB2', 86), ('CDKN2B', 49), ('ATM', 99), ('PREX2', 7), ('TSC1', 23), ('KMT2D', 62), ('RPTOR', 30), ('MET', 44), ('STK11', 54), ('DNMT3B', 32), ('TBK1', 16), ('GRIN2A', 5), ('ZFP36L1', 2), ('FOXO3', 56), ('MYB', 22), ('ERBB4', 46), ('ATR', 22), ('NF2', 46), ('GATA2', 45), ('RET', 54), ('NF1', 94), ('CASP8', 46), ('NEK2', 12), ('CTSS', 6), ('MCL1', 50), ('PRDM1', 16), ('GNAS', 47), ('PTEN', 120), ('NCOA3', 38), ('IRS4', 9), ('CCND3', 29), ('ASXL1', 42), ('PIK3CD', 33), ('KDM6A', 60), ('WT1', 56), ('GATA3', 57), ('CHEK2', 48), ('MAP3K1', 34), ('SETD2', 56), ('MITF', 20), ('FGFR1', 51), ('ESR1', 101), ('MAP2K4', 27), ('MST1', 4), ('BAP1', 21), ('CDKN2A', 108), ('BRCA1', 102), ('BRAF', 92), ('PIK3R1', 72), ('FBXO32', 5), ('NTRK1', 39), ('DICER1', 43), ('TET2', 44), ('FAS', 14), ('ITCH', 25), ('IGF1R', 62), ('FBXW7', 83), ('ARHGAP35', 11), ('PTPN11', 62), ('NOTCH1', 89), ('AKT3', 32), ('PIK3CB', 37), ('TP63', 24), ('ROS1', 29), ('PTPRD', 16), ('ETV6', 46), ('BUB1B', 25), ('TBL1XR1', 24), ('STAG2', 44), ('PALB2', 32), ('ZNF217', 19), ('PCLO', 6), ('SPEN', 14), ('SYNE1', 13), ('NCOA2', 23), ('MED12', 40), ('FGFR2', 51), ('CHD8', 24), ('CTCF', 41), ('CASZ1', 1), ('PHF6', 29), ('SF3B1', 14), ('CUX1', 19), ('ATAD2', 11), ('BRIP1', 13), ('ZMYM3', 11), ('H2BC5', 45), ('SPOP', 24), ('ERCC4', 13), ('CCNI', 4), ('TNFAIP3', 18), ('BCOR', 6), ('DIS3', 7), ('USH2A', 9), ('PAX2', 13), ('TRAF5', 4), ('MAP3K13', 3), ('IRAK4', 3), ('ERG', 6), ('FOXP1', 15), ('EPHB1', 6), ('PTPN22', 6), ('CBLB', 14), ('DDR2', 17), ('KLF6', 10), ('CD3EAP', 4), ('FRMPD2', 1), ('MED23', 8), ('DCAF4L2', 1), ('TBX3', 12), ('CIC', 3), ('CNOT3', 5), ('HRNR', 2), ('RBMX', 6), ('HLA-DRB1', 6), ('GPS2', 8), ('DNAH11', 3), ('RPGR', 3), ('LGALS1', 3), ('EXOC2', 5), ('LRIG2', 1), ('EPHA6', 2), ('EXT2', 1), ('FH', 4), ('SHQ1', 1), ('GPR32', 1), ('PTGFR', 1), ('GPRIN2', 2), ('RYR2', 5), ('PPM1L', 1), ('OR2D2', 1), ('OR6A2', 1), ('PARP4', 1)]

The degree distribution can then be plotted using NetworkX and matplotlib to give a histogram:

PYTHON

from matplotlib.pyplot import subplots, show

degs = dict(BC_g.degree()).values()

fig, ax = subplots()

ax.hist(degs, bins=20);

ax.set_title("Degree Distribution", fontsize=20)
ax.set_ylabel("Count", fontsize=16)
ax.set_xlabel("Degree", fontsize=16);


show()

Degree centrality:


Degree centrality is the degree of each node normalised by the maximum possible degree that exists within the network. It can be calculated easily using the NetworkX function degree_centrality():

PYTHON

degree_centrality = nx.degree_centrality(BC_g)

print(degree_centrality)

OUTPUT

{'ACTG1': 0.07462686567164178, 'ACTL6B': 0.06965174129353234, 'CDH1': 0.3880597014925373, 'CTNNA1': 0.06467661691542288, 'ARID1A': 0.4427860696517413, 'ARID2': 0.20398009950248755, 'SMARCA4': 0.3681592039800995, 'AFDN': 0.07462686567164178, 'FLNA': 0.10945273631840796, 'AKT2': 0.1990049751243781, 'SMARCD1': 0.08955223880597014, 'PBRM1': 0.24378109452736318, 'RHOA': 0.263681592039801, 'AKT1': 0.5671641791044776, 'MYC': 0.6268656716417911, 'ARID1B': 0.1990049751243781, 'NTRK3': 0.10945273631840796, 'RUNX1': 0.3681592039800995, 'CBFB': 0.19402985074626866, 'CCND1': 0.472636815920398, 'DNMT3A': 0.31343283582089554, 'H3C12': 0.4228855721393035, 'RB1': 0.1990049751243781, 'ACVR1B': 0.03980099502487562, 'KRAS': 0.572139303482587, 'PDGFRA': 0.26865671641791045, 'SMAD3': 0.26865671641791045, 'SMAD4': 0.44776119402985076, 'TGFBR2': 0.21393034825870647, 'BRCA2': 0.38308457711442784, 'ALK': 0.26865671641791045, 'CTNNB1': 0.6169154228855721, 'EGFR': 0.4975124378109453, 'KIT': 0.3233830845771144, 'GRIA2': 0.03980099502487562, 'USP9X': 0.08955223880597014, 'MEN1': 0.24378109452736318, 'EPHA7': 0.03482587064676617, 'KMT2A': 0.24875621890547264, 'NRAS': 0.4577114427860696, 'HRAS': 0.4079601990049751, 'AFF2': 0.014925373134328358, 'CBL': 0.17412935323383083, 'ATRX': 0.3383084577114428, 'AGTR2': 0.004975124378109453, 'XBP1': 0.06965174129353234, 'AURKA': 0.1990049751243781, 'CDKN1B': 0.27860696517412936, 'MLH1': 0.2885572139303483, 'MSH2': 0.23383084577114427, 'FOXA1': 0.208955223880597, 'RARA': 0.22388059701492538, 'NOTCH2': 0.2288557213930348, 'APC': 0.0945273631840796, 'MDM2': 0.38308457711442784, 'PIK3C3': 0.1044776119402985, 'KMT2C': 0.3283582089552239, 'AXIN1': 0.1990049751243781, 'CREBBP': 0.3681592039800995, 'CCNE1': 0.25870646766169153, 'EP300': 0.43283582089552236, 'NOTCH3': 0.14427860696517414, 'PIK3CA': 0.5223880597014925, 'CDK6': 0.2885572139303483, 'PMS2': 0.15422885572139303, 'DAXX': 0.17412935323383083, 'ERBB3': 0.27860696517412936, 'NCOR1': 0.24875621890547264, 'TP53': 0.7313432835820896, 'ERBB2': 0.42786069651741293, 'CDKN2B': 0.24378109452736318, 'ATM': 0.4925373134328358, 'PREX2': 0.03482587064676617, 'TSC1': 0.1144278606965174, 'KMT2D': 0.30845771144278605, 'RPTOR': 0.14925373134328357, 'MET': 0.21890547263681592, 'STK11': 0.26865671641791045, 'DNMT3B': 0.15920398009950248, 'TBK1': 0.07960199004975124, 'GRIN2A': 0.02487562189054726, 'ZFP36L1': 0.009950248756218905, 'FOXO3': 0.27860696517412936, 'MYB': 0.10945273631840796, 'ERBB4': 0.2288557213930348, 'ATR': 0.10945273631840796, 'NF2': 0.2288557213930348, 'GATA2': 0.22388059701492538, 'RET': 0.26865671641791045, 'NF1': 0.46766169154228854, 'CASP8': 0.2288557213930348, 'NEK2': 0.05970149253731343, 'CTSS': 0.029850746268656716, 'MCL1': 0.24875621890547264, 'PRDM1': 0.07960199004975124, 'GNAS': 0.23383084577114427, 'PTEN': 0.5970149253731343, 'NCOA3': 0.1890547263681592, 'IRS4': 0.04477611940298507, 'CCND3': 0.14427860696517414, 'ASXL1': 0.208955223880597, 'PIK3CD': 0.16417910447761194, 'KDM6A': 0.29850746268656714, 'WT1': 0.27860696517412936, 'GATA3': 0.2835820895522388, 'CHEK2': 0.23880597014925373, 'MAP3K1': 0.1691542288557214, 'SETD2': 0.27860696517412936, 'MITF': 0.09950248756218905, 'FGFR1': 0.2537313432835821, 'ESR1': 0.5024875621890548, 'MAP2K4': 0.13432835820895522, 'MST1': 0.01990049751243781, 'BAP1': 0.1044776119402985, 'CDKN2A': 0.5373134328358209, 'BRCA1': 0.5074626865671642, 'BRAF': 0.4577114427860696, 'PIK3R1': 0.3582089552238806, 'FBXO32': 0.02487562189054726, 'NTRK1': 0.19402985074626866, 'DICER1': 0.21393034825870647, 'TET2': 0.21890547263681592, 'FAS': 0.06965174129353234, 'ITCH': 0.12437810945273632, 'IGF1R': 0.30845771144278605, 'FBXW7': 0.4129353233830846, 'ARHGAP35': 0.05472636815920398, 'PTPN11': 0.30845771144278605, 'NOTCH1': 0.4427860696517413, 'AKT3': 0.15920398009950248, 'PIK3CB': 0.18407960199004975, 'TP63': 0.11940298507462686, 'ROS1': 0.14427860696517414, 'PTPRD': 0.07960199004975124, 'ETV6': 0.2288557213930348, 'BUB1B': 0.12437810945273632, 'TBL1XR1': 0.11940298507462686, 'STAG2': 0.21890547263681592, 'PALB2': 0.15920398009950248, 'ZNF217': 0.0945273631840796, 'PCLO': 0.029850746268656716, 'SPEN': 0.06965174129353234, 'SYNE1': 0.06467661691542288, 'NCOA2': 0.1144278606965174, 'MED12': 0.1990049751243781, 'FGFR2': 0.2537313432835821, 'CHD8': 0.11940298507462686, 'CTCF': 0.20398009950248755, 'CASZ1': 0.004975124378109453, 'PHF6': 0.14427860696517414, 'SF3B1': 0.06965174129353234, 'CUX1': 0.0945273631840796, 'ATAD2': 0.05472636815920398, 'BRIP1': 0.06467661691542288, 'ZMYM3': 0.05472636815920398, 'H2BC5': 0.22388059701492538, 'SPOP': 0.11940298507462686, 'ERCC4': 0.06467661691542288, 'CCNI': 0.01990049751243781, 'TNFAIP3': 0.08955223880597014, 'BCOR': 0.029850746268656716, 'DIS3': 0.03482587064676617, 'USH2A': 0.04477611940298507, 'PAX2': 0.06467661691542288, 'TRAF5': 0.01990049751243781, 'MAP3K13': 0.014925373134328358, 'IRAK4': 0.014925373134328358, 'ERG': 0.029850746268656716, 'FOXP1': 0.07462686567164178, 'EPHB1': 0.029850746268656716, 'PTPN22': 0.029850746268656716, 'CBLB': 0.06965174129353234, 'DDR2': 0.0845771144278607, 'KLF6': 0.04975124378109452, 'CD3EAP': 0.01990049751243781, 'FRMPD2': 0.004975124378109453, 'MED23': 0.03980099502487562, 'DCAF4L2': 0.004975124378109453, 'TBX3': 0.05970149253731343, 'CIC': 0.014925373134328358, 'CNOT3': 0.02487562189054726, 'HRNR': 0.009950248756218905, 'RBMX': 0.029850746268656716, 'HLA-DRB1': 0.029850746268656716, 'GPS2': 0.03980099502487562, 'DNAH11': 0.014925373134328358, 'RPGR': 0.014925373134328358, 'LGALS1': 0.014925373134328358, 'EXOC2': 0.02487562189054726, 'LRIG2': 0.004975124378109453, 'EPHA6': 0.009950248756218905, 'EXT2': 0.004975124378109453, 'FH': 0.01990049751243781, 'SHQ1': 0.004975124378109453, 'GPR32': 0.004975124378109453, 'PTGFR': 0.004975124378109453, 'GPRIN2': 0.009950248756218905, 'RYR2': 0.02487562189054726, 'PPM1L': 0.004975124378109453, 'OR2D2': 0.004975124378109453, 'OR6A2': 0.004975124378109453, 'PARP4': 0.004975124378109453}

As is evident from the output of the previous cell, the resulting degree centrality is given as a Python dictionary. It is then possible to convert this dictionary into a Pandas DataFrame using the DataFrame.from_dict function:

PYTHON

import pandas as pd

dc_df = pd.DataFrame.from_dict(degree_centrality, orient='index')
dc_df.head()

OUTPUT

               0
ACTG1   0.074627
ACTL6B  0.069652
CDH1    0.388060
CTNNA1  0.064677
ARID1A  0.442786

The resulting network can also then be ranked according the centrality score, as follows:

PYTHON

print(dc_df.sort_values(by=[0]).head())

OUTPUT

                0
PARP4    0.004975
CASZ1    0.004975
FRMPD2   0.004975
DCAF4L2  0.004975
OR6A2    0.004975

Closeness centrality:


Closeness centrality is another topological measure that is commonly used in analysing a biological network. It measures how close a node is to all other nodes in a graph, based on the amount of flow information that can be measured crossing through a node (i.e. the shortest path). Closeness centrality is particularly useful in identifying nodes that can spread information efficiently throughout a network. In a biological context, an example might be a gene that is in the centre of a network, which is highly connected to other genes, and therefore influential to the connectivity measured across the entire network.

In NetworkX, it is possible to calculate closeness centrality using the function: closeness_centrality(). The results of this calculation are presented, once again, in dictionary format; which w can convert into a DataFrame, as before:

PYTHON

closeness_centrality = nx.closeness_centrality(BC_g)
cc_df = pd.DataFrame.from_dict(closeness_centrality, orient='index')
cc_df.head()

OUTPUT

               0
ACTG1   0.469095
ACTL6B  0.447773
CDH1    0.597030
CTNNA1  0.475893
ARID1A  0.625460

Personalised PageRank:


PageRank is a ranking system developed by Google’s founder Larry Page in collaboration with Sergey Brin, and is used to identify the most central or interesting node in a network; conceptually, it is relatively similar to closeness centrality. However, PageRank allso accounts for whether the nodes are connected to another important node in the network.

The London Underground Analogy:

Envision that you are newly settling into London after finding a suitable place to live, close to your workplace. You are exploring London by travelling, at random, across the city using the London Underground (Tube). You can initiate your travel from any station in the Tube network and, while travelling, you decide - by means of a dice toss - which of the neighbouring stations you would like to visit, next.

Suppose you begin this following your arrival at Angel Station. You then toss the dice to decide whether to next visit King’s Cross St. Pancras, or Old Street Station, with an equal chance of visiting each station. Your dice throw points to Old Street Station, and you proceed, accordingly. Once you arrive at Old Street, you then toss the dice again to decide whether or not to travel back to Angel, or proceed to Moorgate. In order to prevent the scenario where you get trapped in one part of the Tube network, an occasional random jump or teleportation to any station on the network, is introduced. This allows you to travel across the entire London Underground network. This process then repeats infinitely, leading to the computation of probabilities for being at a specific station, and this is represented as a PageRank score.

PageRank results in a probability independent of the starting station. However, if you work at a particular station such as Angel, you may wish to live near to a station that is closely connected to Angel, and not too far away from it. In order to address this problem, we can then assign Angel a non-zero value, with other stations set to 0; this means that the PageRank algorithm is forcing you back to Angel, every time a random jump occurs. This process then repeats infinitely, and the probability for each station that you take to Angel, is calculated as a Personalised PageRank (PPR) score.

The Breast Cancer Network:


In the context of a biological network, we may be interested in nodes that are highly connected to a particular gene, protein or transcription factor, for example. Using the Breast Cancer Network as an example to illustrate this, let’s explore nodes within this network that are highly connected to EGFR using the aforementioned PPR score. EGFR is an Epidermal Growth Factor Receptor - a transmembrane receptor which regulates cell growth and proliferation, in which mutations can contribute to the progression of tumours in cancers such as breast cancer.

Fistly, let us observe how we can address the teleportation customisation, using a Python dictionary:

PYTHON

EGFR_dict = dict()

for i in BC_g.nodes():
    if i == "EGFR":
        EGFR_dict[i] = 1 ## Only assigning 1 to EGFR

    else:
        EGFR_dict[i] = 0 ## Assigning 0 to all other nodes in the network

PPR can then be calculated using the pagerank function in NetworkX. As the results are returned in a dictionary, we can use Pandas functions to convert these into a Pandas DataFrame:

PYTHON

EGFR_ppr = nx.pagerank(BC_g, personalization = EGFR_dict)
EGFR_ppr_df = pd.DataFrame.from_dict(EGFR_ppr, orient = "index")
print(EGFR_ppr_df.head())

OUTPUT

               0
ACTG1   0.001526
ACTL6B  0.001219
CDH1    0.009330
CTNNA1  0.002672
ARID1A  0.009530

The results are arranged alphabetically. Let’s have a look at the top 10 nodes from the EGFR PPR analysis:

PYTHON

import matplotlib.pyplot as plt
EGFR_PPR_top10 = EGFR_ppr_df.sort_values(by=[0], ascending=False).head(n = 10)

plt.barh(EGFR_PPR_top10.index,
        EGFR_PPR_top10[0],
        color="blue")

plt.xlabel("Score")
plt.ylabel("Node")
plt.title("EGFR PPR")
plt.legend()
plt.show()

EGFR is ranked highly, as we have prioritised it in the personalisation. In biological context, we know that EGFR, and other proteins such as p53 and PTEN, are often cooperative in the initiation of tumours and their progression. Having them rank as highly as they do, is consistent with this information.

Practise Exercise 3

Calculate the degree centrality of Network 1, and store the result in a Pandas DataFrame:

PYTHON

Network1_DC = nx.degree_centrality(Network1_new)

N1_DC_df = pd.DataFrame.from_dict(Network1_DC, orient='index')

print(N1_DC_df.head(10))

OUTPUT

          0
A  0.333333
B  0.500000
C  0.333333
D  0.666667
E  0.166667
F  0.166667
G  0.166667

Visualising graphs:


Previously, we saw that Python could be used to readily extract information about a given network. We can move on to visualise these networks using the draw function from the NetworkX library.

We can start by visualising the more simple Network 3 example, from the previously referenced example figure:

PYTHON

nx.draw(Network2,
        with_labels = True)
show()

Using NetworkX, we can also dictate the way the nodes are positioned, by constraining and specifying the layout of a network. There is an extensive variety of options for doing this, but we will use a layout known as the ‘spiral’ layout, for our first example. For a list of layout options, refer to the NetworkX documentation.

PYTHON

Network3Layout = nx.spiral_layout(Network2)

nx.draw(Network2, 
        Network3Layout,
        with_labels = True)

show()

Another popular network layout that is frequently used is known as the spring_layout. Lets visualise this layout using the directed network example from our figure (Network 3):

PYTHON

Network3Layout = nx.spring_layout(Network3)

nx.draw(Network3, 
        Network3Layout,
        node_color='pink',
        with_labels = True)

show()

Earlier in this lesson we generated a bar chart of the EGFR top 10 PPR nodes within the Breast Cancer Network. We can now demonstrate how to overlay this bar chart with the subgraph of the top 10 PPR nodes, using the NetworkX function nx.draw(), as follows:

PYTHON

from matplotlib.pyplot import axes, axis, title
import collections
import matplotlib.pyplot as plt

## Generating the dictionary for EGFR PPR
EGFR_dict = dict()
for i in BC_g.nodes():
    if i == "EGFR":
        EGFR_dict[i] = 1 ## Only assigning 1 to EGFR
    else:
        EGFR_dict[i] = 0 ## Assigning 0 to all other nodes in the network

## Calculate PPR from EGFR
EGFR_ppr = nx.pagerank(BC_g,
                        personalization = EGFR_dict)
EGFR_ppr_df = pd.DataFrame.from_dict(EGFR_ppr, orient = "index")

## Getting the top 10 nodes for PageRank
EGFR_PPR_top10 = EGFR_ppr_df.sort_values(by=[0], ascending=False).head(n = 10)

## Creating subplot
fig, ax = subplots()

title("EGFR PPR")

## Generating the horizontal barplot
ax.barh(EGFR_PPR_top10.index,
        EGFR_PPR_top10[0],
        color="blue")


ax.set_xlabel("Score")
ax.set_ylabel("Node")

# draw graph in inset
axes([0.4, 0.4, 0.5, 0.5])

## Extract EGFR top 10 nodes to list
EGFR_PPR_nodes = EGFR_PPR_top10.index.tolist()

BC_EGFR_subgraph = BC_g.subgraph(EGFR_PPR_nodes)
pos = nx.spring_layout(BC_EGFR_subgraph, seed=2)

axis("off")

OUTPUT

(np.float64(0.0), np.float64(1.0), np.float64(0.0), np.float64(1.0))

PYTHON

# this code draws the network node
nx.draw_networkx_nodes(BC_EGFR_subgraph, 
                       pos, 
                       node_size=30, 
                       node_color='r',
                       )
## this code shows the label
nx.draw_networkx_labels(BC_EGFR_subgraph, 
                       pos,
                       {lab:lab for n,lab in enumerate(EGFR_PPR_nodes) if lab in pos},
                       font_size=10,
                       font_color='black')

OUTPUT

{'EGFR': Text(-0.024977112740051644, 0.2477515392662663, 'EGFR'), 'TP53': Text(0.9052864555655546, 0.2197476144101963, 'TP53'), 'MYC': Text(0.5187832997364015, 0.8071792868154481, 'MYC'), 'CTNNB1': Text(0.2390001211855295, -0.4063925724901173, 'CTNNB1'), 'PTEN': Text(0.02694355925654746, -1.0, 'PTEN'), 'AKT1': Text(-0.7663454858284434, 0.49842525992514325, 'AKT1'), 'KRAS': Text(-0.591657474895241, -0.6651621403025627, 'KRAS'), 'PIK3CA': Text(-0.9440882738679962, -0.13432735139928575, 'PIK3CA'), 'CDKN2A': Text(0.8542989972359399, -0.5226701443361513, 'CDKN2A'), 'BRCA1': Text(-0.21724408564824113, 0.9554485081110603, 'BRCA1')}

PYTHON

# this code draws the network edge
nx.draw_networkx_edges(BC_EGFR_subgraph, pos);

plt.show()

It is also possible to customise the network nodes and edges. For instance, it is easily possible to manipulate colour using one of a keyword argument pertaining to one of matplotlib’s many included colours, and also to change both the shape and size of the displayed nodes. Note that the plot will vary each time it is run, due to the layout algorithm selected.

PYTHON

Network1Layout = nx.random_layout(Network1)

newLabels = {
    0: 'A',
    1: 'B',
    2: 'C',
    3: 'D',
    4: 'E',
    5: 'F',
    6: 'G'
}

nx.draw(Network1, 
        Network1Layout,
        labels=newLabels,
        node_color='gold',
        node_shape="H",
        node_size=800)
show()

Should you want to add another note, and have it connected to only one of the existing nodes, for example, you can add one, as follows, using the .add_node method available in NetworkX.

PYTHON

Network1.add_node(7)

print(Network1.nodes)

OUTPUT

[0, 1, 2, 3, 4, 5, 6, 7]

We can make further modifications to the network graph. For instance - say we want to set a new layout for this graph, and update its labels to call the new node ‘H’; and specify to which nodes it connects. We can do so, as follows:

PYTHON

newLayoutNetwork1 = nx.random_layout(Network1)
newLabels = {
    0: 'A',
    1: 'B',
    2: 'C',
    3: 'D',
    4: 'E',
    5: 'F',
    6: 'G',
    7: 'H'
}


nx.draw(Network1, 
        newLayoutNetwork1,
        labels=newLabels,
        node_color='gold',
        node_shape="H",
        node_size=800);

show()

For nodes with high centrality scores, it may also be interesting to find out where in the network they are positioned. As referenced earlier in this lesson, we can make use of both the centrality results we obtained before, together with a function of our own, that makes use of several of the aforementioned NetworkX functions and methods. Into this function, we will build in code that allows us to colour the nodes based on the centrality score. Let’s try this using the Network 4 example.

We must firstly define this function. In this example, we will call the function draw_centarlity:

PYTHON

import matplotlib.colors as mcolors
def draw_centrality(G, pos, measures, measure_name):
    
    nodes = nx.draw_networkx_nodes(G, pos, node_size=250, cmap=plt.cm.plasma, 
                                   node_color=list(measures.values()),
                                   nodelist=measures.keys())
    nodes.set_norm(mcolors.SymLogNorm(linthresh=0.01, linscale=1, base=10))
    labels = nx.draw_networkx_labels(G, pos)
    edges = nx.draw_networkx_edges(G, pos)

    plt.title(measure_name)
    plt.colorbar(nodes)
    plt.axis('off')
    plt.show()

PYTHON

pos = nx.spring_layout(Network4)
draw_centrality(Network4, 
        pos, 
        nx.closeness_centrality(Network4), 
        'Closeness Centrality')

Visualising PPR:

We previously explored how PPR results can be viewed using a bar chart. However, from this, it is unclear as to how the ‘walk’ was performed. Let’s use a block of code here that allows us to visualise the PPR of EGFR from its ten closest, neighbouring nodes:

PYTHON

from matplotlib import animation
from IPython.display import HTML
import random

H = BC_g
top10 = EGFR_ppr_df.sort_values(by=[0], ascending=False).head(n = 10).index
G = H.subgraph(top10)
pos = nx.spectral_layout(G, dim=3)
nodes = np.array([pos[v] for v in G.nodes()])
edges = np.array([(pos[u], pos[v]) for u, v in G.edges()])

def _frame_update(index, target_node):
    ax.clear()
    ax.scatter(*nodes.T, alpha=0.2, s=100, color="blue")
    for key in pos:
        ax.text(pos[key][0],
                pos[key][1],
                pos[key][2],
                s=key)
    for vizedge in edges:
        ax.plot(*vizedge.T, color="gray")
    neighbors = list(G.neighbors(target_node))
    if index % 5 == 0:
        node[0] = random.choice(neighbors)
    node0 = pos[node[0]]
    ax.scatter(*node0, alpha=1, marker="s", color="red", s=100)
    ax.view_init(index * 0.2, index * 0.5)
    ax.grid(False)
    ax.set_axis_off()

    # Set axis limits to zoom in on the network
    ax.set_xlim3d(-0.8, 0.3)
    ax.set_ylim3d(-0.5, 0.5)
    ax.set_zlim3d(-0.5, 0.5)
    
    return

fig = plt.figure()
ax = fig.add_subplot(111, 
                     projection="3d")
ax.grid(False)
ax.set_axis_off()
node = [0]

ani = animation.FuncAnimation(fig, 
                              _frame_update, 
                              interval=20, 
                              frames=range(100), 
                              cache_frame_data=False,
                              fargs=("EGFR",), 
                              repeat=True)

HTML(ani.to_jshtml())

Practise Exercise 4

  1. Use the graph layout alrogrithm called shell_layout to plot Network 2 from our example figure.

  2. Draw the graph with a third layout algorithm called random_layout, and draw the graph several times by repeatedly executing the code.

  3. Draw the graph from the matrix n, created in Practise Exercise 1.1, with the layout algorithm spectral_layout, givint the nodes the names V, X, Y and Z.

PYTHON

# Practise Exercise 4, Question 1:
newLayout = nx.shell_layout(Network2)

nx.draw(Network2, 
        newLayout,
        with_labels = True)

show()

PYTHON

# Practise Exercise 4, Question 2:
newLayout = nx.random_layout(Network2)

nx.draw(Network2, 
        newLayout)

show()

PYTHON

# Practise Exercise 4, Question 3:
from numpy import zeros

n = zeros((4,4))

n[0, 1] = 1
n[1, 2] = 1
n[1, 3] = 1

new = nx.from_numpy_array(n)

newLabels = {
    0: 'V',
    1: 'X',
    2: 'Y',
    3: 'Z',
}

newLayout = nx.spectral_layout(new)

nx.draw(new, newLayout,
        labels=newLabels)

show()

Optional Materials:


Clustering Coefficient:

The clustering coefficient (CC) is a measure of the degree to which nodes in a graph tend to cluster together. Clustering coefficients can range from 0 to 1, where:

  • 0 - none of the node’s neighbours are connected.

  • 1 - All of the node’s neighbours are connected, forming a complete clique.

Similarly, the average CC also indicates how well the nodes in the graph are connected. As with the aforementioned centrality calculations, this can be calculated in a straightforward way for nodes in a graph. The following is a simple block of code that demonstrates how you can calculate the CC using the Breast Cancer Network example:

PYTHON

clustering_coefficients = nx.clustering(BC_g)

# Calculate the average clustering coefficient for the entire graph
average_clustering_coefficient = nx.average_clustering(BC_g)
print(f"Average Clustering Coefficient: {average_clustering_coefficient}")

OUTPUT

Average Clustering Coefficient: 0.5906626376400339

Edge connectivity:

Edge connectivity measures the minimum number of edges that need to be removed in order to disconnect a network graph.

For example, the Breast Cancer Network is composed of two disconnected subgraphs. Let’s calculate the edge connectivity for each of the subgraph components in this network:

PYTHON

# Calculate edge connectivity
for i, c in enumerate(nx.connected_components(BC_g)): 
    edge_connectivity = nx.edge_connectivity(BC_g.subgraph(c))
    print(f"Subgraph size for {i} component: {len(c)}")
    print(f"Edge Connectivity for {i} component: {edge_connectivity}") 

OUTPUT

Subgraph size for 0 component: 200
Edge Connectivity for 0 component: 1
Subgraph size for 1 component: 2
Edge Connectivity for 1 component: 1

Despite the two subgraphs being very different to each other, both only needed to remove a single edge in order to disconnect the subgraph, completely.

Shortest path length:

The path length refers to the distance between two nodes in a network graph, calculated as the number of edges required to get from one node to the other. More specific information about a graph can be found using measures that build on the path length, such as the shortest path length, average shortest path length, and the shortest path length from the node in question, to all reachable nodes.

For illustration, here is how we can find out the shortest path of our Breast Cancer Network graph, from node ACTG1 to AGTR2 - the top, least-connected (in terms of degree centrality) nodes in the network. The output is the sequence of nodes along this shortest path.

PYTHON

my_shortest_paths = nx.all_shortest_paths(BC_g, 
                                          source="ACTG1", 
                                          target="AGTR2")

for path in my_shortest_paths:
    print(path)

OUTPUT

['ACTG1', 'ACTL6B', 'H3C12', 'AGTR2']
['ACTG1', 'CDH1', 'H3C12', 'AGTR2']
['ACTG1', 'ARID1A', 'H3C12', 'AGTR2']
['ACTG1', 'ARID2', 'H3C12', 'AGTR2']
['ACTG1', 'SMARCA4', 'H3C12', 'AGTR2']
['ACTG1', 'SMARCD1', 'H3C12', 'AGTR2']
['ACTG1', 'PBRM1', 'H3C12', 'AGTR2']
['ACTG1', 'AKT1', 'H3C12', 'AGTR2']
['ACTG1', 'MYC', 'H3C12', 'AGTR2']
['ACTG1', 'ARID1B', 'H3C12', 'AGTR2']

There are ten possible routes to get from node ACTG1 through to AGTR2: and it is via two intermediate nodes.

As a single summary quantity of a network, we can find the average shortest path length, as follows:

PYTHON

for c in nx.connected_components(BC_g): 
    print(nx.average_shortest_path_length(BC_g.subgraph(c)))

OUTPUT

2.0121608040201004
1.0

Assignment Questions:


End of chapter Exercises:

Using the code to create the (undirected) GZMB PPI network given at the start of this lesson.

  1. Experiment with the different layout algorithms, the node colour and shape, and the edge color.

Hint:

  • Using a pen and paper, draw a diagram of the PPI network.
  • Decide on the ordering of the nodes. One option is to start with the centre node, and subsequently number them in an anticlockwise direction. (Hint: There are 11 nodes in total.)
  • Label your pen and paper diagram with the node numbers.
  • Work out all the edges in the graph. Note the numbers of the node at the start of the edge and the node at the end of the edge. (Hint: There are 20 edges in total.)
  • Define a 11×11 matrix in Python using the NumPy zeros function.
  • For each edge, set the corresponding entry in the matrix to 1 (start node number corresponds to row number and end node number corresponds to column number in the adjacency matrix).
  • Define a cell array containing your node names and create a NetworkX object.
  • Experiment with the layout, colours, node shapes etc.

Remember to use the Python documentation, e.g. using help() if you have problems (alternatively use a search engine such as Google).

  1. Compute the number of edges in the generated network.

  2. Calculate the closeness centrality of each node in the network. Identify the top 5 nodes in the network.

  3. Plot the degree distribution of the network with inserted network graph.

  4. Calculate the Personalised PageRank starting from GZMB in this network.

Key Points

  • Concept of a network graph as a means of describing the interconnectivity between nodes, using edges.
  • Concept of directed and undirected networks.
  • Practice in creating network matrices, importing and handling network data.
  • Practice in quantifying network properties.
  • Practice in visualising networks.

Content from Network Applications


Last updated on 2025-02-10 | Edit this page

Estimated time: 120 minutes

Download Chapter notebook (ipynb)

Download Chapter PDF

Mandatory Lesson Feedback Survey

Overview

Questions

  • How can a meaningful biological network be constructed?
  • How can networks be analysed to answer biological questions?
  • How can we interpret the results of network analyses in a biological context?

Objectives

  • Reviewing data import and network quantification
  • Network construction based on biological question of interest
  • Selecting and using appropriate analyses, based on a research question

PYTHON

import networkx as nx
import numpy as np
import pandas as pd

from matplotlib.pyplot import subplots, show
import matplotlib.pyplot as plt

Importing data into NetworkX


In the previous lesson, we explored the basics of NetworkX, creating and analysing various networks and their properties; we also practised these techniques using some named examples, including the Breast Cancer Network dataset. As previously demonstrated, using NetworkX with data of your own requirest that you import these data in an appropriate, and compatible manner. Specifically, this requires knowledge surrounding importing data from a multitude of different data formats. We will give an overview of a few of these formats, here.

CSV Format

A frequently-encountered file format is CSV, whose .csv file extension is an abbreviation that stands for comma-separated values. In Networks 1, we used this file format to import the Breast Cancer Network as an edgelist. In this lesson, we will explore reading this data into NetworkX as an adjacency matrix.

The network data file ‘breastcancermatrix.csv’ comprises a large adjacency matrix of containing values of 0 and 1, with no further information included. The ‘breastcancerlabel.csv’ file, however, contains the list of neuron names, which we use to label the nodes of the network we will create. We will how this import works, in greater detail. While there are a multiple different methods we can use to import a CSV file, in this example, we will make use of the Pandas function read_csv.

The following code will import the entire Breast Cancer Network, as an adjacency matrix:

PYTHON

BC_matrix = pd.read_csv('Data/breastcancermatrix.csv', 
                    header = 0, 
                    index_col = 0,
                    dtype = "int")

BC_matrix = BC_matrix.to_numpy()

print(len(BC_matrix))

OUTPUT

202

This imports the CSV file, as before. The keyword argument header allows us to sepcify that Pandas should treat the first row (index 0) of data as a header (with name and textual information, as opposed to this being the first row of data). Thus, Pandas knows to parse or handle the second row of the file as the first row of actual data. The code then goes on to specify that the adjacency matrix is of data type int, indicating that integers are expected (although this does not necessarily have to be the case, for example, in a weighted network where weight data may also be present). The code then goes on to convert the Breast Cancer Network adjacency matrix from the newly-created Pandas DataFrame into a NumPy array, using the .to_numpy() method.

In a similar fashion, the labels for the Breast Cancer Network nodes (BC_Names) should be imported. They will, however, need to be converted from a Pandas DataFrame into a dictionary, which is the format required by NetworkX for node labels:

PYTHON

BC_Names = pd.read_csv('data/breastcancerlabel.csv', 
                    header=None)

BCnodeNames = BC_Names.to_dict()

BCLabels = BCnodeNames[0]

print(BCLabels)

OUTPUT

{0: 'ACTG1', 1: 'ACTL6B', 2: 'CDH1', 3: 'CTNNA1', 4: 'ARID1A', 5: 'ARID2', 6: 'SMARCA4', 7: 'AFDN', 8: 'FLNA', 9: 'AKT2', 10: 'SMARCD1', 11: 'PBRM1', 12: 'RHOA', 13: 'AKT1', 14: 'MYC', 15: 'ARID1B', 16: 'NTRK3', 17: 'RUNX1', 18: 'CBFB', 19: 'CCND1', 20: 'DNMT3A', 21: 'H3C12', 22: 'RB1', 23: 'ACVR1B', 24: 'KRAS', 25: 'PDGFRA', 26: 'SMAD3', 27: 'SMAD4', 28: 'TGFBR2', 29: 'BRCA2', 30: 'ALK', 31: 'CTNNB1', 32: 'EGFR', 33: 'KIT', 34: 'GRIA2', 35: 'USP9X', 36: 'MEN1', 37: 'EPHA7', 38: 'KMT2A', 39: 'NRAS', 40: 'HRAS', 41: 'AFF2', 42: 'CBL', 43: 'ATRX', 44: 'AGTR2', 45: 'XBP1', 46: 'AURKA', 47: 'CDKN1B', 48: 'MLH1', 49: 'MSH2', 50: 'FOXA1', 51: 'RARA', 52: 'NOTCH2', 53: 'APC', 54: 'MDM2', 55: 'PIK3C3', 56: 'KMT2C', 57: 'AXIN1', 58: 'CREBBP', 59: 'CCNE1', 60: 'EP300', 61: 'NOTCH3', 62: 'PIK3CA', 63: 'CDK6', 64: 'PMS2', 65: 'DAXX', 66: 'ERBB3', 67: 'NCOR1', 68: 'TP53', 69: 'ERBB2', 70: 'CDKN2B', 71: 'ATM', 72: 'PREX2', 73: 'TSC1', 74: 'KMT2D', 75: 'RPTOR', 76: 'MET', 77: 'STK11', 78: 'DNMT3B', 79: 'TBK1', 80: 'GRIN2A', 81: 'ZFP36L1', 82: 'FOXO3', 83: 'MYB', 84: 'ERBB4', 85: 'ATR', 86: 'NF2', 87: 'GATA2', 88: 'RET', 89: 'NF1', 90: 'CASP8', 91: 'NEK2', 92: 'CTSS', 93: 'MCL1', 94: 'PRDM1', 95: 'GNAS', 96: 'PTEN', 97: 'NCOA3', 98: 'IRS4', 99: 'CCND3', 100: 'ASXL1', 101: 'PIK3CD', 102: 'KDM6A', 103: 'WT1', 104: 'GATA3', 105: 'CHEK2', 106: 'MAP3K1', 107: 'SETD2', 108: 'MITF', 109: 'FGFR1', 110: 'ESR1', 111: 'MAP2K4', 112: 'MST1', 113: 'BAP1', 114: 'CDKN2A', 115: 'BRCA1', 116: 'BRAF', 117: 'PIK3R1', 118: 'FBXO32', 119: 'NTRK1', 120: 'DICER1', 121: 'TET2', 122: 'FAS', 123: 'ITCH', 124: 'IGF1R', 125: 'FBXW7', 126: 'ARHGAP35', 127: 'PTPN11', 128: 'NOTCH1', 129: 'AKT3', 130: 'PIK3CB', 131: 'TP63', 132: 'ROS1', 133: 'PTPRD', 134: 'ETV6', 135: 'BUB1B', 136: 'TBL1XR1', 137: 'STAG2', 138: 'PALB2', 139: 'ZNF217', 140: 'PCLO', 141: 'SPEN', 142: 'SYNE1', 143: 'NCOA2', 144: 'MED12', 145: 'FGFR2', 146: 'CHD8', 147: 'CTCF', 148: 'CASZ1', 149: 'PHF6', 150: 'SF3B1', 151: 'CUX1', 152: 'ATAD2', 153: 'BRIP1', 154: 'ZMYM3', 155: 'H2BC5', 156: 'SPOP', 157: 'ERCC4', 158: 'CCNI', 159: 'TNFAIP3', 160: 'BCOR', 161: 'DIS3', 162: 'USH2A', 163: 'PAX2', 164: 'TRAF5', 165: 'MAP3K13', 166: 'IRAK4', 167: 'ERG', 168: 'FOXP1', 169: 'EPHB1', 170: 'PTPN22', 171: 'CBLB', 172: 'DDR2', 173: 'KLF6', 174: 'CD3EAP', 175: 'FRMPD2', 176: 'MED23', 177: 'DCAF4L2', 178: 'TBX3', 179: 'CIC', 180: 'CNOT3', 181: 'HRNR', 182: 'RBMX', 183: 'HLA-DRB1', 184: 'GPS2', 185: 'DNAH11', 186: 'RPGR', 187: 'LGALS1', 188: 'EXOC2', 189: 'LRIG2', 190: 'EPHA6', 191: 'EXT2', 192: 'FH', 193: 'SHQ1', 194: 'GPR32', 195: 'PTGFR', 196: 'GPRIN2', 197: 'RYR2', 198: 'PPM1L', 199: 'OR2D2', 200: 'OR6A2', 201: 'PARP4'}

As covered in the Basic Python lessons, dictionaries are associative arrays, containing entries comprising pairs of both keys and corresponding values. In the code block given above, we are converting the BCnodeNames DataFrame into a dictionary using the .to_dict() method. The numerical node (protein) identifiers are the keys, and the corresponding node (protein) names are the values. Note that the to_dict() method wraps the dictionary within a second, containing dictionary. Therefore, in order to obtain the plain labels dictionary alone, we must index into it, using index ‘0’; this is subsequently stored into the BCLabels variable for easy handling, in subsequent code.

Next, let’s create a graph, specify a layout, and plot the network:

PYTHON

BCGraph = nx.from_numpy_array(BC_matrix)

BCLayout = nx.random_layout(BCGraph, seed=123)

nx.draw_networkx(BCGraph, 
                BCLayout,
                node_size=100,
                labels = BCLabels)

show()

At first glance, the resulting network graph appears to be quite complex and chaotic. Although you couldn’t tell by looking at this, the Breast Cancer Network contains two nodes that are isolated from the bigger module of the network. Using the random layout, we cannot see these isolated nodes. In this particular case, we also have other metadata that can be used for visualisation and analysis. We have a second file - ‘BCpositions2D.csv’ - which contains information on how the nodes relate to each other in a 2-dimensional space. We can include this information to replace the layout we visualised in the code chunk, above.

In the block of code given below, we can also adjust the plotting size frame to enlarge the plotting area for the network. We can also change the edge colour to make it transparent, making it far easier for us to interpret the network structure. The resulting graph shows, much more clearly, the two isolated nodes, as well.

PYTHON

import matplotlib.pyplot as plt
BC2DPos = pd.read_csv('Data/BCpositions2D.csv', 
                   header=0,
                   index_col=0)

BC2DPositions = BC2DPos.values

fig = plt.figure(1, figsize=(20, 20))
nx.draw(BCGraph, 
        BC2DPositions,
        node_size=1500,
        edge_color="#20808080",
        labels = BCLabels)

show()

Using this new 2-D visualisation method, it is clear that these two isolated proteins - OR6A2 and OR2D2 (top right) - stand alone, away from the rest of the proteins in the Breast Cancer Network. For context, OR6A2 and OR2D2 encode part of the olfactory receptor.

Olfactory receptor proteins are members of a large family of G-protein-coupled receptors arising from single, coding exon genes. Upon binding, they activate a signal transduction pathway that results in the generation of an electrical signal in olfactory sensory neurons, ultimately leading to the perception of smell in the brain. Although, a few studies have previously shown that olfactory receptor binding can also activate pathways associated to survival and proliferation of cancer cells (including MAPK, Rho and AKT signalling cascades), olfactory receptors are frequently neglected in cancer genomic and transcriptomic studies. It is assumed that their specialised role in the olfactory epithelium makes them unlikely to contribute to cancer development. Such negligence is also the likely explanation of the isolation and disconnection of OR6A2 and OR2D2 from the main functional module of the Breast Cancer Network.

PYTHON

BC3DPos = pd.read_csv('Data/BCpositions3D.csv', 
                   header=0,
                   index_col=0)

## Code to convert positions to dictionary
BC3DPositions = BC3DPos.T.to_dict('list')

## Code to assign node and edge position
nodes = np.array([BC3DPos.loc[v].values for v in BCGraph])
edges = np.array([(BC3DPos.loc[u].values, BC3DPos.loc[v].values) for u, v in BCGraph.edges()])

Note: the block of code below may take some time to run, so please be patient.

PYTHON

from matplotlib import animation
from IPython.display import HTML
def init():
    ax.scatter(*nodes.T,
               alpha=0.2,
               s=80,
               color="blue")
    for vizedge in edges:
        ax.plot(*vizedge.T,
                color="#05808080")
    ax.grid(False)
    ax.set_axis_off()
    plt.tight_layout()
    return


def _frame_update(index):
    ax.view_init(index * 0.2, index * 0.5)
    for key in BC3DPositions:
        ax.text(BC3DPositions[key][0],
                BC3DPositions[key][1],
                BC3DPositions[key][2],
                s=BCLabels[key])
    return


fig = plt.figure(1, figsize=(10, 10))
ax = fig.add_subplot(111,
                     projection="3d")
ax.grid(False)
ax.set_axis_off()

ani = animation.FuncAnimation(
                            fig,
                            _frame_update,
                            init_func=init,
                            interval=100,
                            cache_frame_data=False,
                            frames=25,
)

HTML(ani.to_jshtml())

In the resulting 3-dimensional plot displayed above, we grouped proteins that are enriched in the cellular membrane, cytoplasm and those usually associated around the nucleus. Using this visualization method, we are able to determine that most interactions are associated with proteins across different cellular compartments, rather than within the same compartments (i.e. less interactions across membrane-bound proteins).

We can also see that most proteins are grouped into the nucleus, rather than at the membrane. Plots like this, when combined with the Personalised PageRank simulation discussed in the previous lesson, may help us envision how topological results project onto a constructed network, taking into consideration the position of a node. ## The Network Repository - List of Edges

Network files in the Network Repository are sourced from publications, and designated a ‘.edges’ file extension. The files themselves are plain text files. Let’s have a look at the network of a mouse visual cortex.You can either download the zip file from the database, or use the file provided with this lesson.

Place the file ‘bn-mouse_visual-cortex_1.edges’ in your working directory. It can be helpful to first open it in a simple text editor to see what the data look like. In this case, it is a list of two numbers per row, separated by a space. This is the list of (directed) edges. The first number indicating ‘from’ or source, the second ‘to’ or target. Nodes are not given explicitly, but are inferred from the indices.

Being a list of edges, we can import this into NetworkX as an edgelist, specifying that the nodes are given as integers. The result is a standard graph object which we can plot using the NetworkX function, draw.

PYTHON

MouseCortex = nx.read_edgelist("Data/bn-mouse_visual-cortex_1.edges", nodetype=int)

MouseCortexLayout = nx.random_layout(MouseCortex, seed=111)

nx.draw(MouseCortex, 
        MouseCortexLayout,
        node_size=1000,
        node_color='r',
        with_labels=True)

show()

To interpret this network, you can check it against the source MOUSE-VISUAL-CORTEX-1.

JSON Format:

JSON (JavaScript Object Notation) is an open standard file format that is language-independent and widely used. Here is some example code, where we import and display the network that contains co-occurances of characters in Victor Hugo’s novel ‘Les Misérables’.

PYTHON

import json

with open('Data/miserables.json', 'r') as myfile:

    data=myfile.read()


miserables = json.loads(data)

miserablesNetwork = nx.json_graph.node_link_graph(miserables)

miserablesNetworkLayout = nx.circular_layout(miserablesNetwork)

fig, ax = subplots(figsize=(12,12))

nx.draw(miserablesNetwork, miserablesNetworkLayout,
        node_size=3000,
        with_labels=True)

fig.tight_layout()

show()

The ‘data’ variable contains the character names as node names. If you are curious, open the file with a text editor in order to manually check the entries, yourself.

For example, our file starts like this:

PYTHON

{

  "nodes": [

    {"id": "Myriel", "group": 1},

OUTPUT

'[' was never closed (<string>, line 3)

The syntax - with curly brackets and colons - immediately suggests the use of associative arrays (dictionary), once the data are within Python. The function node_link_graph provides an interface to handle the data as a NetworkX graph.

There are many other file formats that can contain network data. For a list of NetworkX functions that allow import and export using other file formats, see reading and writing graphs. ## Random Network Generation

NetworkX has a number of methods for creating matrices with given specifications. In the previous lesson, we have seen that use of NumPy’s .zeros function in an example such as .zeros((5,5)) can be used to create a 5 × 5 matrix filled with zeroes. This speeds up the creation of a network matrix with only a few non-zero entries. Similarly, to approach this from another angle, NumPy’s function .ones can be used to create a network matrix where most entries are 1, with few modifications required to create zero entries.

However, to test code for building a network it is often useful to be able to easily create test matrices, without having to fill in information about the edges, manually. For instance, it is not unusual to create an arbitrary number of realisations quickly, in order to study the distribution of network properties.

The code below, shows a fast way to create matrices with randomly assigned edges, using the function randint.

PYTHON

from numpy.random import randint

nodes = 5

rm = randint(0, 2, size=(nodes, nodes))

print(rm)

OUTPUT

[[0 1 1 0 0]
 [0 1 0 1 1]
 [1 0 0 0 1]
 [1 0 1 0 1]
 [1 0 0 0 1]]

The function randint, from the NumPy module random, is used to create an array or matrix filled with integers.

The first two arguments - two integers - specify which integers to use. The first number is the smallest integer, the second number is the largest integer plus one. The first two numbers work as (a,b+1): thus, this function will produce numbers (N) in the range specified as: a<=N<=b. In our case entering (0, 2) will produce zeroes and ones. With (1, 10), all integers from 1 to 9 will be used, all with equal probability.

The keyword argument size specifies the dimensions of the matrix. In our case, we want a nodes × nodes matrix for a specified number of nodes. The output of the code will look different at each function call, as each time you execute the code, Python will assign the zeroes and ones, randomly.

There are many ways to create network matrices with different specifications. These can often be used to test null hypotheses about experimental data. For example, networks with the same amount of nodes and edges can be generated as you might expect in an experimental network, but with the random assignment of edges in order to test whether the observed connections are likely to be due to chance. Some biological networks seem to have the so-called small world property where, in spite of relatively few connections, there are often fast ways to get from one node to any other node by tracing a path along combinations of edges.

There is a way to have Python return the same random numbers when using functions like randint, by using the seed function (also abailable in the random module). Reproducibility is pivotal in coding, and functions such as this facilitate it.

The command seed(1), for example, sets the value of the ‘seed’, used to initialise the random number generator, to 1. The choice of seed value is arbitrary. The important thing to note is that specifying a value will mean that the random number results are reproducible.

When we specify a starting seed value for the random number generator, it still produces random numbers; but will produce the same set of random numbers each time the code is executed. In order to test this, try running the following example:

PYTHON

from numpy.random import seed

seed(1)

rm1 = randint(0, 2, size=(nodes, nodes))

print(rm1)

OUTPUT

[[1 1 0 0 1]
 [1 1 1 1 0]
 [0 1 0 1 1]
 [0 0 1 0 0]
 [0 1 0 0 1]]

And try repeating it, again:

PYTHON

seed(1)

rm2 = randint(0, 2, size=(nodes, nodes))

print(rm2)

OUTPUT

[[1 1 0 0 1]
 [1 1 1 1 0]
 [0 1 0 1 1]
 [0 0 1 0 0]
 [0 1 0 0 1]]

Since we set the same, arbitrary seed for both calls of the randint function, the same, identical pattern of 0s and 1s is produced, each time. Verified with:

PYTHON

rm1 == rm2

OUTPUT

array([[ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True]])

Bipartite Networks


In the previous lesson, we saw an example of a bipartite network in one of the Practise Exercises, but we didn’t load it into the Python environment. Bipartite networks are handled differently by NetworkX (and other Python network packages), in comparison to ordinary networks, as the nodes have different properties. Let’s try and generate a bipartite network from the previous lesson:

PYTHON

Network2 = nx.Graph()

# Add nodes with the node attribute "bipartite"
Network2.add_nodes_from(['A','C','E','F','G'], bipartite=0)
Network2.add_nodes_from(['B','D'], bipartite=1)

# Add edges only between nodes of opposite node sets
Network2.add_edges_from([('A','B'),
                         ('C','B'),
                         ('F','B'),
                         ('G','B'),
                         ('C','D'),
                         ('E','D'),
                         ('F','D'),
                         ('G','D')
                          ])

Here, we set up a bipartite network with 7 nodes. The 0 group has nodes ′A′,′C′,′E′,′F′,′G′, and the 1 group has nodes ′B′,′D′. The specified edges link nodes from the two groups with one another, but not to any nodes within their own group. NetworkX has a specific .is_connected function to check that your nodes are connected:

PYTHON

nx.is_connected(Network2)

OUTPUT

True

This will return a Boolean value - either True or False. Be cautious however, as this only tests for connection, not whether your graph is truly bipartite. You can use the function nx.is_bipartite(Network2) from the submodule networkx.algorithms to test whether or not your network is bipartite. It will similarly return a Boolean, with True indicating a bipartite network.

PYTHON

nx.is_bipartite(Network2)

OUTPUT

True

We can then plot this bipartite graph, using a layout of your choice. If we set the keyword argument condition with_labels=True, the node names that we set earlier become node labels.

PYTHON

Network2Layout = nx.spiral_layout(Network2)

nx.draw(Network2, 
        Network2Layout,
        node_size=2000,
        with_labels=True)

show()

This might not look like a bipartite network; but if you check the edges that we set up earlier, you can see that no node has an edge with another node in each group, therefore qualifying this as a bipartite network. If we want it to look more like a classic bipartite network, we can make use of the attributes we set up earlier and the module networkx.algorithms in order to make a custom layout, and more clearly visualise the bipartite nature of this network graph:

PYTHON

groupzero = nx.bipartite.sets(Network2)[0]

bipartitePos = nx.bipartite_layout(Network2, groupzero)

nx.draw(Network2, 
        bipartitePos,
        node_size=2000,
        with_labels = True)

show()

Using the bipartite network convention of giving one group of nodes the attribute 0 and the other 1, means that you can use this to change other aspects of your graph: such as colour. In this example, we can use the attributes of the nodes to assign a colour to each group of nodes. The colour list can then be incorporated into the plot:

PYTHON

color_dictionary = {0: 'crimson', 1: 'pink'}

color_list = list()

for attr in Network2.nodes.data('bipartite'):

    color_list.append(color_dictionary[attr[1]])

print(color_list)

OUTPUT

['crimson', 'crimson', 'crimson', 'crimson', 'crimson', 'pink', 'pink']

PYTHON

nx.draw(Network2, 
        bipartitePos,
        node_size=2000,
        font_size=20,
        with_labels=True,
        node_color=color_list)

show()

A comprehensive introduction to bipartite networks and their application to gene-disease networks can be found in section 2.7 of the online textbook Network Science by A. L. Barabási.

Review of Network Properties

In the previous lesson, we learned a few ways of quantifying node properties using centrality scores. We also learnt about Personalised PageRank and applied this onto the Breast Cancer Network.

Let’s take a look at how we can incorporate these network analysis results to our network visualization, using the ‘miserablesNetwork’ again, as an example:

PYTHON

# Firstly, define a function to draw node colours by centrality scores:

import matplotlib.colors as mcolors
%matplotlib inline

def draw(G, pos, measures, measure_name):
    
    nodes = nx.draw_networkx_nodes(G, pos, node_size=250, cmap=plt.cm.plasma, 
                                   node_color=list(measures.values()),
                                   nodelist=measures.keys())
    nodes.set_norm(mcolors.SymLogNorm(linthresh=0.01, linscale=1, base=10))
    # labels = nx.draw_networkx_labels(G, pos)
    edges = nx.draw_networkx_edges(G, pos)

    plt.title(measure_name)
    plt.colorbar(nodes)
    plt.axis('off')
    plt.show()

OUTPUT

invalid syntax (<string>, line 4)

PYTHON

# We can then calculate the closeness centrality results in the Breast Cancer Network:
closeness_centrality = nx.closeness_centrality(miserablesNetwork)

# Adding together the position of the nodes using spring_layout:
pos = nx.spring_layout(miserablesNetwork, seed=675)

# We can then plot the network by centrality results:
draw(miserablesNetwork, 
     pos, 
     closeness_centrality,
     'Closeness Centrality')

OUTPUT

NameError: name 'draw' is not defined

PYTHON

## Similarly, we can do this for degree centrality:
draw(miserablesNetwork, 
     pos, 
    nx.degree_centrality(miserablesNetwork),
     'Degree Centrality')

OUTPUT

NameError: name 'draw' is not defined

Similarly, we can also change the size of the nodes based on their centrality scores.

For presentability and ease, let’s begin by defining our own function to do so:

PYTHON

def draw_size(G, pos, measures, measure_name):
    
    cent = np.fromiter(measures.values(), float)
    sizes = cent / np.max(cent) * 200
    nodes = nx.draw_networkx_nodes(G, pos, node_size=sizes, cmap=plt.cm.plasma, 
                                   node_color=list(measures.values()),
                                   nodelist=measures.keys())
    nodes.set_norm(mcolors.SymLogNorm(linthresh=0.01, linscale=1, base=10))
    # labels = nx.draw_networkx_labels(G, pos)
    edges = nx.draw_networkx_edges(G, pos)

    plt.title(measure_name)
    plt.colorbar(nodes)
    plt.axis('off')
    plt.show()

And plotting it for the centrality scores:

PYTHON

## Similarly, we could do this for degree centrality:
draw_size(miserablesNetwork, 
     pos, 
    nx.degree_centrality(miserablesNetwork),
     'Degree Centrality')

OUTPUT

NameError: name 'mcolors' is not defined

Network Analysis using Real-World Datasets


In the following section, you can choose one of the following examples:

  • Identification of key targets in cancer networks
  • Neural signal propagation in C. elegans

NOTE: For your Networks 2 assignment, there are two options available to you. Dependent on which of the above examples you choose, you can answer the question set at the end of the cancer network section, or that at the end of the C. elegans section.

Identification of Key Targets in Breast and Pancreatic Cancer Networks:


Protein-protein interaction (PPI) networks offer a powerful approach to understanding the complex molecular mechanisms underlying diseases, including cancers. We touched on these briefly in the first Networks lesson. By mapping the interactions between proteins, researchers can identify key nodes and pathways critical to cancer progression. This network-based analysis not only highlights potential biomarkers for early diagnosis, but also reveals novel therapeutic targets. In breast and pancreatic cancers, leveraging PPI networks can uncover specific proteins that, when targeted, may disrupt cancer cell survival and proliferation, paving the way for more effective and personalised treatment strategies.

In this section, we will explore the construction of breast and pancreatic cancer networks using the genes that have been previously identified in the Network of Cancer Genes (NCG). The PPI database used is STRING-db, a database of known and predicted protein-protein interactions.

To simplify the process, all interactions used in this chapter have been filtered for high confidence scores (>400) from STRING-db; we only extracted the expanded network from breast, pancreatic and non-small cell lung cancers. For a full network, please refer to the Homo sapiens PPI network in STRING-db.

Firstly, let’s read in the (mock) STRING database network. This data is significantly reduced in size, compared to the full dataset:

PYTHON

string_db = pd.read_csv("Data/STRING_db.csv")
string_db.head()

OUTPUT

  protein1 protein2  weight
0    PALB2    BRCA1     999
1    PALB2    BRCA2     999
2    PALB2    BRIP1     973
3    PALB2    KEAP1     964
4    PALB2      FAS     837

Let’s then convert the entire STRING database into an undirected and weighted graph:

PYTHON

STRING_network = nx.from_pandas_edgelist(string_db,
                               source = "protein1",
                               target = "protein2",
                               edge_attr = True)

We can then check if the network is directed and weighted, using the following code:

PYTHON

nx.is_directed(STRING_network)

OUTPUT

False

PYTHON

nx.is_weighted(STRING_network)

OUTPUT

True

And we can test the number of nodes and edges using the following code:

PYTHON

STRING_num_node = STRING_network.number_of_nodes()
STRING_num_edge = STRING_network.number_of_edges()

print("Number of nodes in STRING db:", STRING_num_node)

OUTPUT

Number of nodes in STRING db: 257

PYTHON

print("Number of edges in STRING db:", STRING_num_edge)

OUTPUT

Number of edges in STRING db: 1374

Extraction of a subnetwork with disease genes of interest:

The STRING database network contains a large number of nodes and edges which may not all be relevant to breast or pancreatic cancers (the two cancers we are looking at, in this lesson). To extract related edges we must first find out which proteins are relevant to breast or pancreatic cancers. We can do this using the list of genes curated from this paper. These gene lists are stored in the Cancer_GeneList.csv file which we will import as a Pandas DataFrame, using the following code:

PYTHON

genelist_df = pd.read_csv("Data/Cancer_GeneList.csv")
genelist_df.head()

OUTPUT

   entrez  ...                        method
0      71  ...                        MutSig
1      91  ...  HOTNET2, OncodriveFM, MutSig
2      91  ...                    Wood-based
3      92  ...  HOTNET2, OncodriveFM, MutSig
4     107  ...                    Wood-based

[5 rows x 7 columns]

From visualising the top few rows of the DataFrame, there are a few cancers that have been included in this data. To extract the breast or pancreatic cancer gene list, we will have to filter the DataFrame, as follows:

PYTHON

BC_genelist = genelist_df[genelist_df["primary_site"] == "breast"]["symbol"].tolist()

PYTHON

Pan_genelist = genelist_df[genelist_df["primary_site"] == "pancreas"]["symbol"].tolist()

To create the breast cancer network, we will have to extract the relevant interactions from the STRING database network using the NetworkX function nx.subgraph, with a list of the nodes of interest:

PYTHON

BC_network = STRING_network.subgraph(BC_genelist)

And similarly using the Pan_genelist we created for pancreatic cancer:

PYTHON

PC_network = STRING_network.subgraph(Pan_genelist)

Practise Exercise 1

  1. Create a small network from the STRING-db PPI using the following list of genes: ["CASP8","TP53","FAS","MYC"]

  2. Calculate the number of edges in this network.

Answer:

PYTHON

# 1. Create a small network from gene list
genelist = ["CASP8","TP53","FAS","MYC"]
small_network = STRING_network.subgraph(genelist)

PYTHON

# 2. Calculate number of edges in network
small_network.number_of_edges()

OUTPUT

3

Node centralities in the cancer networks:

Once we have extracted the networks, we can study the node properties, as we did, previously. This helps us to understand how important these nodes are, topologically, in the network.

One of the few things we have covered is degree centrality. Let’s calculate this for the breast cancer network, and plot this network with nodes coloured by degree centrality:

PYTHON

## Adding together the position of the nodes using spring layout
pos = nx.spring_layout(BC_network, seed=675)

## We can then plot the network by centrality results
draw(BC_network, 
     pos, 
     nx.degree_centrality(BC_network),
     'Breast Cancer Network Degree Centrality')

OUTPUT

NameError: name 'draw' is not defined

Practise Exercise 2

  1. Calculate the closeness centrality of each node in the pancreatic cancer network.

  2. Plot the pancreatic cancer network using the spring_layout and draw the node colour based on the closeness centrality.

Answer:

PYTHON

# 1. Calculate the closeness centrality of each node in the pancreatic cancer network.
PC_closeness = nx.closeness_centrality(PC_network)

PYTHON

# 2. Plot pancreatic cancer network with closeness centrality results
draw(PC_network, 
     nx.spring_layout(PC_network, seed=675), 
     PC_closeness,
     'Pancreatic Cancer Network Closeness Centrality')

OUTPUT

NameError: name 'draw' is not defined

Comparing the importance of TP53 in breast and pancreatic cancer networks:

The TP53 gene, often referred to as the “guardian of the genome”, plays a crucial role in cancer prevention by encoding the p53 protein;responsible for the regulation of cell division, and promoting DNA repair. When functioning correctly, p53 can induce cell cycle arrest, allowing time for DNA repair, or trigger apoptosis if the damage is irreparable. However, mutations in the TP53 gene are common in many cancers - including breast, pancreatic and lung cancers - leading to the loss of its tumor-suppressing abilities. These mutations can result in uncontrolled cell proliferation, contributing to tumor growth and metastasis. TP53 is therefore a significant focus in cancer research, with efforts aimed at restoring its normal function to inhibit cancer progression.

In this lesson, we will consider the importance of TP53 in the breast and pancreatic cancer networks. In particular, we are interested in comparing the TP53 connections between breast and pancreatic cancers. As the network constructed from STRING database is represented as an undirected network, the flow of signal (i.e. co-expression and regulation) between the connected proteins will always occur, bidirectionally. Our goal is, therefore, to explore the differences in TP53 signal propagation between the two types of cancer based on their network structure.

PPR from TP53:

In the previous lesson, we learned how personalised PageRank (PPR) can be used to study highly connected train stations, such as Kings Cross St. Pancras Station on the London Underground map. We then effectively applied the same algorithm to study genes that are strongly connected to, and therefore influenced by, EGFR in the Breast Cancer Network. In a similar way, we can also apply PPR to TP53, in order to understand which other proteins may be highly connected to the protein-protein interactions based on the weighted edges within the network.

To do this, we can start of with a personalisation list in the Breast cancer network:

PYTHON

BC_dict = {}
for i in BC_network.nodes():
    if i == "TP53":
        BC_dict[i] = 1
    else:
        BC_dict[i] = 0

Then, we can run personalised PageRank, as follows:

PYTHON

BC_TP53_ppr = nx.pagerank(BC_network,
                    personalization = BC_dict)

Results of this PPR are then stored in a dictionary. We can rearrange the results to show top proteins as a DataFrame, as follows:

PYTHON

BC_TP53_ppr_df = pd.DataFrame.from_dict(BC_TP53_ppr, orient = "index")

## Rearrange dataframe result 
BC_TP53_ppr_df.columns = ["TP53_PPR"]
BC_TP53_ppr_df = BC_TP53_ppr_df.sort_values(by='TP53_PPR', ascending=False)

print(BC_TP53_ppr_df.head(n = 10))

OUTPUT

          TP53_PPR
TP53      0.181315
EP300     0.025523
BRCA1     0.023955
CREBBP    0.020186
ESR1      0.020051
SMARCA4   0.019565
HIST1H3B  0.018125
CTNNB1    0.017180
MYC       0.015891
MDM2      0.015678

Practise Exercise 3

  1. Repeat the Personalised PageRank analysis, seeding from TP53, in pancreatic cancer.

  2. Rearrange the PPR results and identify the top 20 targets in the network.

Answer:

PYTHON

# 1. PPR from TP53 in pancreatic cancer
PC_dict = {}
for i in PC_network.nodes():
    if i == "TP53":
        PC_dict[i] = 1
    else:
        PC_dict[i] = 0
    
PC_TP53_ppr = nx.pagerank(PC_network,
                    personalization = PC_dict)

PYTHON

# 2. Rearrange results to obtain top 20 targets
PC_TP53_ppr_df = pd.DataFrame.from_dict(PC_TP53_ppr, orient = "index")
PC_TP53_ppr_df.columns = ["TP53_PPR"]
PC_TP53_ppr_df = PC_TP53_ppr_df.sort_values(by='TP53_PPR', ascending=False)

print(PC_TP53_ppr_df.head(n = 20))

OUTPUT

         TP53_PPR
TP53     0.229126
SMARCA4  0.054104
EP300    0.050519
CTNNB1   0.048529
SMAD3    0.042565
DAXX     0.037529
HDAC2    0.033861
ATM      0.033723
ARID1A   0.028497
PTEN     0.023989
BRCA2    0.023390
ATRX     0.021206
VHL      0.020524
CDKN2A   0.019953
SMAD4    0.019284
AXIN1    0.019268
FBXW7    0.018287
KMT2C    0.017271
TGFBR1   0.015744
KMT2D    0.015398

Shortest Paths between KRAS and TP53:

The KRAS gene is one of the most frequently-mutated oncogenes in cancer, playing a pivotal role in driving tumor growth and progression. Mutations in KRAS lead to the continuous activation of signaling pathways that promote cell proliferation and survival, contributing to the development of cancers such as lung, colorectal, and pancreatic cancers. When KRAS mutations co-occur with TP53 mutations, the effects can be particularly aggressive. Mutant KRAS can enhance the oncogenic activities of mutant TP53, leading to increased tumor growth, metastasis and chemoresistance. This interaction underscores the complexity of cancer biology and highlights the need for targeted therapies that can address the combined effects of these mutations.

In network biology, the concept of the shortest path can be incredibly useful for studying the interactions between KRAS and TP53. In this context, the shortest path refers to the minimal number interactions needed to connect two proteins within the cancer network. By identifying the shortest path between KRAS and TP53, we can uncover direct and indirect interactions that may influence cancer progression. Shortest path can be helpful in mapping out how mutant KRAS might affect the function of TP53, or vice versa, providing insights into the combined effects of these mutations on tumor growth and metastasis. Furthermore, understanding these pathways can aid in the development of targeted therapies that specifically disrupt these critical interactions, potentially leading to more effective cancer treatments.

Let’s try and quantify this using the shortest path analysis, with specific application to KRAS and TP53:

PYTHON

## Extract all the shortest paths between KRAS and TP53 in breast cancer network:
KRAS_TP53_BC = nx.all_shortest_paths(BC_network, 
                 source="KRAS", 
                 target="TP53",
                 weight = None)

KRAS_TP53_BC = list(KRAS_TP53_BC)

print(KRAS_TP53_BC)

OUTPUT

[['KRAS', 'PIK3R1', 'TP53']]

Note: we have specified the keyword argument weight here to have a value of None. This is because, although shortest path analysis can take into account edge weights, larger edge weights would be misinterpreted as meaning longer distance; this is because distance is calculated as the sum of the edge weights. In our network, the edge weights represent a higher confidence in interactions between the proteins, thus a stronger connection. In this scenario, the shortest path algorithm would therefore be evaluating the edge weight differently to the biological context.

To overcome this, we could summarise the edge weights through the NetworkX function nx.path_weight:

PYTHON

## Path weight for BC network
path_weight = {}
for path in KRAS_TP53_BC:
    pw = nx.path_weight(BC_network,
                tuple(path),
                "weight")
    path_weight[tuple(path)] = pw

## First convert the results into a dataframe
BC_pathweight = pd.DataFrame.from_dict(path_weight, orient = "index")
BC_pathweight.columns = ["weight"]

## Then calculate the mean value
BC_mean_pathweight = BC_pathweight["weight"].mean()
print(BC_mean_pathweight)

OUTPUT

1131.0

Similarly, let’s repeat the analysis in the pancreatic network:

PYTHON

## Extract the shortest paths between KRAS and TP53 in Pancreatic cancer network
KRAS_TP53_PC= nx.all_shortest_paths(PC_network, 
                 source="KRAS", 
                 target="TP53",
                  weight = None)
KRAS_TP53_PC = list(KRAS_TP53_PC)
KRAS_TP53_PC

OUTPUT

[['KRAS', 'PIK3CA', 'FLG', 'CTNNB1', 'SMAD3', 'TP53'], ['KRAS', 'PIK3CA', 'FLG', 'CTNNB1', 'AXIN1', 'TP53'], ['KRAS', 'PIK3CA', 'FLG', 'CTNNB1', 'SMARCA4', 'TP53'], ['KRAS', 'PIK3CA', 'FLG', 'CTNNB1', 'EP300', 'TP53'], ['KRAS', 'PIK3CA', 'FLG', 'CTNNB1', 'FBXW7', 'TP53'], ['KRAS', 'PIK3CA', 'FLG', 'CTNNB1', 'DAXX', 'TP53'], ['KRAS', 'PIK3CA', 'FLG', 'CTNNB1', 'PTEN', 'TP53'], ['KRAS', 'PIK3CA', 'FLG', 'CTNNB1', 'TP53BP2', 'TP53']]

PYTHON

## Calculate path weight for pancreatic cancer network
path_weight = {}
for path in KRAS_TP53_PC:
    pw = nx.path_weight(PC_network,
                tuple(path),
                "weight")
    path_weight[tuple(path)] = pw

## Calculate the average path weight
PC_pathweight = pd.DataFrame.from_dict(path_weight, orient = "index")
PC_pathweight.columns = ["weight"]
PC_mean_pathweight = PC_pathweight["weight"].mean()
print(PC_mean_pathweight)

OUTPUT

3506.25

The shortest path between KRAS and TP53 is significantly shorter in breast cancer, when compared to pancreatic cancer network, suggesting shorter regulations between the two proteins between the two networks. However, analysing the average path weight reveals that the pancreatic cancer network KRAS/TP53 axis has stronger interaction confidence, when compared to the breast cancer KRAS-TP53 axis. Why would that be the case?

We only calculated the path weight without considering the path length, and a longer path is always expected to result in a higher path weight. We therefore need to bring this into consideration in our code. Thus, path weight by length is calculated, as follows:

PYTHON

## Calculate path weight for breast cancer network
path_weight = {}
for path in KRAS_TP53_BC:
    pw = nx.path_weight(BC_network,
                tuple(path),
                "weight")
    path_length = len(path)-1 # path length is always equal to number of nodes - 1, ie., for path length in A-B-C will equalt to 2
    pw_by_length = pw/path_length
    path_weight[tuple(path)] = pw_by_length

## Calculate the average path weight
BC_pathweight = pd.DataFrame.from_dict(path_weight, orient = "index")
BC_pathweight.columns = ["weight"]
BC_mean_pathweight = BC_pathweight["weight"].mean()
print(BC_mean_pathweight)

OUTPUT

565.5

PYTHON

## Calculate path weight for pancreatic cancer network
path_weight = {}
for path in KRAS_TP53_PC:
    pw = nx.path_weight(PC_network,
                tuple(path),
                "weight")
    path_length = len(path)-1 # path length is always equal to number of nodes - 1, ie., for path length in A-B-C will equalt to 2
    pw_by_length = pw/path_length
    path_weight[tuple(path)] = pw_by_length

## Calculate the average path weight
PC_pathweight = pd.DataFrame.from_dict(path_weight, orient = "index")
PC_pathweight.columns = ["weight"]
PC_mean_pathweight = PC_pathweight["weight"].mean()
print(PC_mean_pathweight)

OUTPUT

701.25

By considering the path length in the calculation, we can now confirm that there is a higher confidence in the KRAS-TP53 interaction axis in the pancreatic network, when comparing this to the breast cancer network. Based on these networks, we can therefore speculate that regulating KRAS expression may have a bigger impact on TP53 expression in pancreatic cancer than in breast cancer.

In fact, KRAS mutations are highly prevalent in pancreatic cancer, particularly in pancreatic ductal adenocarcinoma (PDAC), where they occur in approximately 95% of cases. Conversely, KRAS mutations are less common in breast cancer; and targeting the downstream pathways activated by KRAS (such as the MAPK pathway) has shown more promise.

Practise Exercise 4

  1. Repeat the above analysis, but this time between BRCA1 to TP53, in the breast cancer network. What is the average shortest path weight?

  2. What is the average shortest path weight by path length?

Answer:

PYTHON

# 1. Shortest path between BRCA to TP53 in breast cancer
BRCA_TP53_BC= nx.all_shortest_paths(BC_network, 
                 source="BRCA1", 
                 target="TP53",
                  weight = None)
BRCA_TP53_BC = list(BRCA_TP53_BC)
BRCA_TP53_BC

OUTPUT

[['BRCA1', 'TP53']]

PYTHON

# 2. Calculate path weight for breast cancer network
path_weight = {}
for path in BRCA_TP53_BC:
    pw = nx.path_weight(BC_network,
                tuple(path),
                "weight")
    path_length = len(path)-1 # path length is always equal to number of nodes - 1, ie., for path length in A-B-C will equal to 2
    pw_by_length = pw/path_length
    path_weight[tuple(path)] = pw_by_length

## Calculate the average path weight
BC_BRCA_pathweight = pd.DataFrame.from_dict(path_weight, orient = "index")
BC_BRCA_pathweight.columns = ["weight"]
BC_BRCA_mean_pathweight = BC_BRCA_pathweight["weight"].mean()
print(BC_BRCA_mean_pathweight)

OUTPUT

993.0

Modulating the KRAS/TP53 axis:

Based on the shortest path analysis we demonstrated between KRAS and TP53 in the breast cancer network, we can clearly see that some of the nodes connecting KRAS and TP53 are also ranked highly in the TP53 PPR analysis. Thus, these nodes may have an important role to the KRAS/TP53 axis in the breast cancer network.

Let’s now explore the impact of removing these nodes to the shortest path connection, and the PPR results of the breast cancer network.

First let’s identify the top node in the KRAS/TP53 shortest path overlapping as the top node in the PPR result:

PYTHON

# Flatten the nested list in the shortest path analysis:
BC_SP_nodes = [item for sublist in KRAS_TP53_BC for item in sublist]
BC_SP_nodes = list(set(BC_SP_nodes))
BC_SP_nodes                           

OUTPUT

['TP53', 'PIK3R1', 'KRAS']

PYTHON

# From the list of nodes, remove KRAS and TP53, and extract their PPR scores:
BC_SP_PPR = {}
for node in BC_SP_nodes:
    if node in ["KRAS","TP53"]:
        continue # Skip the function
    else: 
        BC_SP_PPR[node] = BC_TP53_ppr[node]

# Convert results to dataframe 
BC_SP_PPR_df = pd.DataFrame.from_dict(BC_SP_PPR,
                                   orient = "index")

# Sort dataframe by PageRank result
BC_SP_PPR_df = BC_SP_PPR_df.sort_values(by=0, ascending=False)
BC_SP_PPR_df 

OUTPUT

               0
PIK3R1  0.012323

The top node is therefore (unsuprisingly) PIK3R1. Let’s remove it from the breast cancer network, and see how this impacts the network in in regards to the KRAS/TP53 PPR:

PYTHON

# Make a copy of the graph
BC_network_new = BC_network.copy()

# Remove a node from the copied graph
BC_network_new.remove_node("PIK3R1")

# Print the remaining nodes and edges
print("Nodes:", BC_network_new.number_of_nodes())

OUTPUT

Nodes: 168

PYTHON

print("Edges:", BC_network_new.number_of_edges())

OUTPUT

Edges: 865

Let’s repeat the PPR and shortest path analysis, as before:

PYTHON

KRAS_TP53_dict_new_BC = {}
for i in BC_network_new.nodes():
    if i  == "TP53":
        KRAS_TP53_dict_new_BC[i] = 1
    else:
        KRAS_TP53_dict_new_BC[i] = 0
BC_TP53_ppr_new = nx.pagerank(BC_network_new,
                    personalization = KRAS_TP53_dict_new_BC)
BC_TP53_ppr_new_df = pd.DataFrame.from_dict(BC_TP53_ppr_new, orient = "index")

## Sort by PPR result
BC_TP53_ppr_new_df.columns = ["New_PPR"]
BC_TP53_ppr_new_df = BC_TP53_ppr_new_df.sort_values(by='New_PPR', ascending=False)

print(BC_TP53_ppr_new_df.head(n = 10))

OUTPUT

           New_PPR
TP53      0.182037
EP300     0.026320
BRCA1     0.024468
CREBBP    0.020806
ESR1      0.020254
SMARCA4   0.020144
HIST1H3B  0.018656
CTNNB1    0.017451
MYC       0.016382
MDM2      0.015969

And explore top changes by nodes:

PYTHON

PPR_df = pd.merge(BC_TP53_ppr_df, 
                  BC_TP53_ppr_new_df, 
                  left_index=True, right_index=True, ## joining by index
                  how='inner') ## Makes sure only taking in common nodes
PPR_df.columns = ["PPR","New_PPR"]
PPR_df["Differences"] = PPR_df["New_PPR"] - PPR_df["PPR"]

PPR_df = PPR_df.reindex(PPR_df['Differences'].abs().sort_values(ascending=False).index)

top_changes = PPR_df.head(10).index.tolist()
print("Top 10 nodes with most signficant changes:", top_changes)

OUTPUT

Top 10 nodes with most signficant changes: ['EP300', 'TP53', 'CREBBP', 'SMARCA4', 'HIST1H3B', 'BRCA1', 'PIK3CB', 'PIK3CA', 'ATM', 'MYC']

And repeating the shortest path analysis:

PYTHON

KRAS_TP53_BC_new = nx.all_shortest_paths(BC_network_new, 
                 source="KRAS", 
                 target="TP53",
                  weight = None)
KRAS_TP53_BC_new = list(KRAS_TP53_BC_new)
KRAS_TP53_BC_new

OUTPUT

[['KRAS', 'NF1', 'HIST1H3B', 'TP53'], ['KRAS', 'EGFR', 'ESR1', 'TP53'], ['KRAS', 'PIK3CA', 'ESR1', 'TP53'], ['KRAS', 'EGFR', 'PTEN', 'TP53'], ['KRAS', 'HRAS', 'AURKA', 'TP53'], ['KRAS', 'BRAF', 'MAP3K1', 'TP53']]

PYTHON

# Calculate path weight for breast cancer network
path_weight = {}
for path in KRAS_TP53_BC_new:
    pw = nx.path_weight(BC_network_new,
                tuple(path),
                "weight")
    path_length = len(path)-1 # path length is always equal to number of nodes - 1, ie., for path length in A-B-C will equal to 2
    pw_by_length = pw/path_length
    path_weight[tuple(path)] = pw_by_length

## Calculate the average path weight
BC_BRCA_pathweight_new = pd.DataFrame.from_dict(path_weight, orient = "index")
BC_BRCA_pathweight_new.columns = ["weight"]
BC_BRCA_mean_pathweight_new = BC_BRCA_pathweight["weight"].mean()
print(BC_BRCA_mean_pathweight_new)

OUTPUT

993.0

Assignment Questions (Cancer Example):


End of chapter Exercises (Cancer Example):

NOTE: You have one of two options for this assignment. If you choose the Cancer example, please complete and submit this assignment only (and not the C. elegans nematode neural propagation example).

  1. Read in the STRING database network and the cancer gene list. Create a smaller network containing only the lung cancer proteins.

  2. Calculate the number of nodes and edges in the lung cancer network.

  3. Calculate the Personalised PageRank (PPR) of the lung cancer network from TP53. Which are the top 10 nodes by PageRank score?

  4. Calculate all shortest paths and their associated edge weights between KRAS and TP53. What is the average edge weight by path length of the shortest path?

  5. Based on the shortest path results and the PPR results, removing which node would have most impact on the KRAS/TP53 axis in this network? Please give your answer by generating a new network and calculating a new set of PPR results and shortest path result.

Neural Signal Propagation in Nematode Worms


Image reference.

Caenorhabditis elegans (C. elegans), a nematode worm, is a key model organism in biology due to its transparency and simplicity, making it ideal for visualisation and research. In this example, we are going to look at its neural network, or connectome. C. elegans was the first multicellular organism to have its genome fully sequenced, and remains the only one with a complete connectome, or neural map. Such understanding has proven pivotal in the advancement of human brain research. Additionally, the C. elegans connectome has also been used to study complex networks.

The network we are exploring in this section, is one of many connectomes provided by Witvliet et al., 2021 in studying C. elegans development. Of particular interest is comparing the connectome structures between young adult (YA) and initial larva stage (L1) nematodes.

The spread of signal between connected neurons always occurs in the same direction, and therefore these networks are instrinsically visualised as directed network graphs. An animal’s nervous system changes as its body grows - from birth through adulthood - and its behaviours mature accordingly. Thus, our aim is to explore the differences in signal propagation between these stages of the C. elegans connectome.

Reading in larval and young adult adjacency matrices:

The L1 stage in C. elegans refers to the first larval stage after hatching. During this stage, the neural network of C. elegans is relatively simple but, at the same time, well-defined. We will read the L1 connectome, which is stored as an adjacency matrix, into a NetworkX object:

PYTHON

L1_am = pd.read_csv("Data/L1_adjacency.csv",
                    index_col=0)
L1_connectome = nx.from_pandas_adjacency(L1_am,
                                         create_using = nx.DiGraph())

Note that we have specified the algorithm using DiGraph() when converting the adjacency matrix to a NetworkX graph object using the .from_pandas_adjacency function. This ensures that the graph is directed, when read. We can check whether this is true using the NetworkX function .is_directed:

PYTHON

nx.is_directed(L1_connectome)

OUTPUT

True

Additionally, if you look carefully at the adjacency matrix, the edges are non-binary:

PYTHON

L1_am.head()

OUTPUT

          IL1L  BWM-DR06  IL2VR  IL2VL  ...  DB1  DA1  Fragment2  CANR
IL1L         0         0      0      0  ...    0    0          0     0
BWM-DR06     0         0      0      0  ...    0    0          0     0
IL2VR        0         0      0      0  ...    0    0          0     0
IL2VL        0         0      0      0  ...    0    0          0     0
BWM-VL05     0         0      0      0  ...    0    0          0     0

[5 rows x 200 columns]

These numeric values represent the edge weights, and are assigned to reflect the number of neural connections (synapses or gap junctions) between the two nodes. We can, again, check whether the weight is retained in the NetworkX graph object using the function .is_weighted():

PYTHON

nx.is_weighted(L1_connectome)

OUTPUT

True

Contrastingly, the young adult connectome represents the neural circuit that governs various behaviours - including locomotion, feeding, mating and sensory responses. This connectome allows the young adult nematode to move more efficiently in line with its environment, with sensory neurons helping it to respond to external stimuli. Let’s read in the young adult (YA) adjacency matrix into a NetworkX graph object:

PYTHON

YA_am = pd.read_csv("Data/YA_adjacency.csv",
                    index_col=0)
YA_connectome = nx.from_pandas_adjacency(YA_am,
                                        create_using = nx.DiGraph())

Practise Exercise 5

  1. How many nodes and edges are there in the L1 network?

  2. How many nodes and edges are in the YA network?

  3. Using the hint from the L1 network above, test whether the YA connectome is weighted.

  4. Similarly, test whether the YA network is directed.

Answer:

PYTHON

# 1. Number of nodes and edges in the L1 network:
L1_num_node = L1_connectome.number_of_nodes()
L1_num_edge = L1_connectome.number_of_edges()

print("Number of nodes in L1:", L1_num_node)

OUTPUT

Number of nodes in L1: 200

PYTHON

print("Number of edges in L1:", L1_num_edge)

OUTPUT

Number of edges in L1: 9639

PYTHON

# 2. Number of nodes and edges in the YA network:
YA_num_node = YA_connectome.number_of_nodes()
YA_num_edge = YA_connectome.number_of_edges()

print("Number of nodes in YA:", YA_num_node)

OUTPUT

Number of nodes in YA: 216

PYTHON

print("Number of edges in YA:", YA_num_edge)

OUTPUT

Number of edges in YA: 12561

PYTHON

# 3. Test if YA connectome is weighted:
nx.is_weighted(YA_connectome)

OUTPUT

True

PYTHON

# 4. Test if YA connectome is directed:
nx.is_directed(YA_connectome)

OUTPUT

True

Node centralities in the C. elegans connectome:

Unlike the network we saw in the previous chapter, the C. elegans connectome is directed. It may, therefore, be important to consider the direction of the edges when we calculate the degree centrality of each neuron. Specifically, the number of edges that are flowing in could be calculated as in-degree using the function in_degree:

PYTHON

L1_indegree = L1_connectome.in_degree()
print(L1_indegree)

OUTPUT

[('IL1L', 53), ('BWM-DR06', 11), ('IL2VR', 29), ('IL2VL', 34), ('BWM-VL05', 13), ('CEPVR', 64), ('CEPVL', 56), ('OLQVL', 50), ('BWM-VL06', 14), ('SAADR', 68), ('RIH', 83), ('RIR', 80), ('SMBVR', 77), ('AIAL', 33), ('BDUL', 38), ('BDUR', 35), ('RID', 70), ('RMED', 65), ('RIGR', 67), ('AIYR', 52), ('SIAVR', 59), ('SIAVL', 57), ('SIADL', 60), ('SMBDL', 78), ('SMBDR', 77), ('AIBR', 89), ('RIMR', 68), ('RIBR', 59), ('RICR', 90), ('AVBR', 67), ('SAADL', 67), ('AVL', 54), ('BWM-VL07', 4), ('BWM-VR01', 26), ('BWM-VR04', 18), ('BWM-VR02', 22), ('BWM-VR03', 20), ('BWM-VL02', 21), ('BWM-VL01', 20), ('BWM-VL03', 18), ('AWCL', 57), ('AWBL', 45), ('ASIL', 31), ('AWAL', 34), ('URADL', 35), ('IL1DR', 54), ('URYDR', 61), ('OLQDR', 54), ('URYDL', 75), ('IL1DL', 49), ('URBL', 54), ('OLQDL', 62), ('URYVL', 50), ('URAVL', 36), ('URAVR', 36), ('IL1VL', 38), ('IL1VR', 36), ('OLQVR', 48), ('RIPR', 29), ('URYVR', 54), ('RMER', 50), ('BWM-VR05', 12), ('BAGR', 75), ('BAGL', 63), ('BWM-DL05', 16), ('BWM-DL06', 19), ('RMDVR', 81), ('RMEL', 49), ('RMDVL', 72), ('SMDVR', 75), ('AVAR', 70), ('SAAVR', 68), ('RMDR', 65), ('RMEV', 82), ('CEPDR', 52), ('CEPDL', 53), ('AVAL', 73), ('RMDL', 69), ('URXL', 66), ('AVEL', 79), ('AFDL', 32), ('ADFL', 38), ('SMDDR', 80), ('SIBVL', 97), ('SMDDL', 74), ('AWBR', 43), ('ASHR', 52), ('ASKL', 33), ('RIAL', 63), ('SMDVL', 73), ('RMDDR', 62), ('AWCR', 53), ('ADFR', 36), ('AVER', 84), ('SIBDR', 85), ('RMDDL', 69), ('RIAR', 69), ('AFDR', 28), ('SAAVL', 69), ('ASKR', 30), ('SIBVR', 88), ('ADLR', 35), ('AWAR', 35), ('AIBL', 97), ('ASHL', 56), ('AUAL', 40), ('AUAR', 51), ('AIAR', 34), ('SIADR', 68), ('ASGR', 19), ('ASER', 43), ('ADLL', 36), ('AVHR', 58), ('ASGL', 20), ('ASJL', 39), ('RIBL', 70), ('ASEL', 46), ('AVBL', 76), ('BWM-DR07', 9), ('AVJR', 64), ('RIVR', 74), ('AVDR', 85), ('RIVL', 75), ('AVJL', 64), ('AIMR', 46), ('AIZR', 59), ('AVHL', 56), ('RIML', 69), ('FLPR', 21), ('RIS', 78), ('AVDL', 79), ('RICL', 96), ('AINR', 55), ('AIML', 42), ('RIFR', 42), ('RIFL', 45), ('ADER', 28), ('AIYL', 43), ('FLPL', 27), ('AIZL', 57), ('AINL', 56), ('ADEL', 25), ('AVKL', 54), ('AVKR', 53), ('ADAR', 74), ('ADAL', 75), ('RMGR', 60), ('RMGL', 62), ('RIGL', 61), ('SIBDL', 87), ('ASJR', 40), ('ALMR', 38), ('ALML', 28), ('BWM-VL08', 5), ('BWM-DR04', 20), ('BWM-DR02', 17), ('BWM-DL02', 21), ('IL1R', 49), ('BWM-DR01', 18), ('URBR', 45), ('RIPL', 30), ('URXR', 59), ('BWM-DL01', 25), ('SMBVL', 76), ('BWM-VR06', 13), ('BWM-VR08', 8), ('BWM-DR03', 21), ('BWM-DR05', 15), ('BWM-DR08', 6), ('BWM-DL08', 9), ('BWM-DL03', 18), ('BWM-VR07', 11), ('URADR', 36), ('OLLL', 57), ('IL2L', 47), ('BWM-DL04', 15), ('PVQR', 50), ('PVPR', 41), ('ALA', 48), ('IL2DL', 35), ('BWM-VL04', 14), ('ASIR', 33), ('OLLR', 63), ('IL2DR', 43), ('IL2R', 45), ('PVT', 59), ('PVPL', 46), ('PVQL', 31), ('PVCL', 39), ('DVC', 35), ('DVA', 92), ('PVCR', 49), ('PVR', 84), ('Fragment1', 8), ('excgl', 43), ('SABD', 18), ('DB1', 17), ('DA1', 13), ('Fragment2', 8), ('CANR', 3)]

The output is a list of tuples showing the neuron followed by its in-degree. We can convert this into a more readable format (such as a Pandas DataFrame), as follows:

PYTHON

L1_indegree_df = pd.DataFrame(list(dict(L1_indegree).items()), 
                              columns=['Node', 'In-Degree'])
L1_indegree_df

OUTPUT

          Node  In-Degree
0         IL1L         53
1     BWM-DR06         11
2        IL2VR         29
3        IL2VL         34
4     BWM-VL05         13
..         ...        ...
195       SABD         18
196        DB1         17
197        DA1         13
198  Fragment2          8
199       CANR          3

[200 rows x 2 columns]

And similarly for out-degree, which is the number of edges flowing out of the neuron node, using the function out_degree:

PYTHON

L1_outdegree = L1_connectome.out_degree()
L1_outdegree_df = pd.DataFrame(list(dict(L1_outdegree).items()), 
                              columns=['Node', 'Out-Degree'])
L1_outdegree_df

OUTPUT

          Node  Out-Degree
0         IL1L          53
1     BWM-DR06          11
2        IL2VR          29
3        IL2VL          34
4     BWM-VL05          13
..         ...         ...
195       SABD          18
196        DB1          17
197        DA1          13
198  Fragment2           8
199       CANR           3

[200 rows x 2 columns]

In the previous lesson, we also learned how to plot the network nodes based on their centrality scores. Similarly, we can plot the L1 network based on the in-degree and out-degree centralities. First, let’s load and define our plotting function:

PYTHON

import matplotlib.colors as mcolors
def draw_centrality(G, pos, measures, measure_name):
    
    nodes = nx.draw_networkx_nodes(G, pos, node_size=200, cmap=plt.cm.plasma, 
                                   node_color=list(measures.values()),
                                   nodelist=measures.keys())
    nodes.set_norm(mcolors.SymLogNorm(linthresh=0.01, linscale=1, base=10))
    # labels = nx.draw_networkx_labels(G, pos)
    edges = nx.draw_networkx_edges(G, pos)

    plt.title(measure_name)
    plt.colorbar(nodes)
    plt.axis('off')
    plt.show()

Then we can pass the in-degree results for L1 to this plotting function. Note that this function requires that this is formatted as a dictionary of keys and values:

PYTHON

plt.figure(figsize=(12, 8))
pos = nx.spring_layout(L1_connectome)
draw_centrality(L1_connectome, 
        pos, 
        dict(L1_indegree), 
        'In-degree')

And similarly for the out-degree centrality of L1:

PYTHON

plt.figure(figsize=(12, 8))
pos = nx.spring_layout(L1_connectome)
draw_centrality(L1_connectome, 
        pos, 
        dict(L1_outdegree), 
        'Out-degree')

Practise Exercise 6

  1. Calculate the in-degree in the YA network.

  2. Generate a network plot of the YA network with node colours based on the in-degree centrality.

  3. Calculate the out-degree in the YA network.

  4. Generate a network plot of the YA network with node colours based on the out-degree centrality.

Answer:

PYTHON

#1. In-degree of YA network.
YA_indegree = YA_connectome.in_degree()
YA_indegree_df = pd.DataFrame(list(dict(YA_indegree).items()), 
                              columns=['Node', 'In-Degree'])
YA_indegree_df

OUTPUT

         Node  In-Degree
0        RIMR         95
1        RIVL         87
2    BWM-DR01         24
3        AIML         54
4        IL1R         53
..        ...        ...
211      AVBR        101
212      AVEL         97
213      AVJR         76
214      AVBL         98
215      AVJL         81

[216 rows x 2 columns]

PYTHON

# 2. Plot YA network with nodes by in-degree

plt.figure(figsize=(12, 8))
pos = nx.spring_layout(YA_connectome)
draw_centrality(YA_connectome, 
        pos, 
        dict(YA_indegree), 
        'Young Adult Network In-Degree')

PYTHON

# 3. Out-degree of YA network.
YA_outdegree = YA_connectome.out_degree()
YA_outdegree_df = pd.DataFrame(list(dict(YA_outdegree).items()), 
                              columns=['Node', 'Out-Degree'])
YA_outdegree_df

OUTPUT

         Node  Out-Degree
0        RIMR          95
1        RIVL          87
2    BWM-DR01          24
3        AIML          54
4        IL1R          53
..        ...         ...
211      AVBR         101
212      AVEL          97
213      AVJR          76
214      AVBL          98
215      AVJL          81

[216 rows x 2 columns]

PYTHON

# 4. Plot YA network with nodes by out-degree

plt.figure(figsize=(12, 8))
pos = nx.spring_layout(YA_connectome)
draw_centrality(YA_connectome, 
        pos, 
        dict(YA_outdegree), 
        'Young Adult Network Out-Degree')

Comparing the in-degree changes between the L1 and YA network:

The number of in- and out-degrees is drastically different between the L1 and YA networks. It might, therefore, be interesting to perform a side-by-side comparison of the in- and out-degrees of common neurons across the two networks. We can easily do so by combining the DataFrames calculated above:

PYTHON

Indegree_df = pd.merge(L1_indegree_df, 
                     YA_indegree_df, 
                     on='Node', ## joining by node name
                     how='inner') ## Makes sure only taking in common nodes
Indegree_df.columns = ["Node","L1","YA"]
Indegree_df

OUTPUT

         Node  L1   YA
0        IL1L  53   56
1    BWM-DR06  11   25
2       IL2VR  29   33
3       IL2VL  34   38
4    BWM-VL05  13   25
..        ...  ..  ...
189       DVC  35   73
190       DVA  92  102
191      PVCR  49   47
192       PVR  84   77
193      CANR   3    2

[194 rows x 3 columns]

We can then calculate the differences between the columns:

PYTHON

Indegree_df["Differences"] = Indegree_df["L1"] - Indegree_df["YA"]

## Sort by differences in absolute difference value
Indegree_df = Indegree_df.reindex(Indegree_df['Differences'].abs().sort_values(ascending=False).index)

Indegree_df.head(10)

OUTPUT

     Node  L1   YA  Differences
136  ADER  28   79          -51
141  ADEL  25   76          -51
189   DVC  35   73          -38
29   AVBR  67  101          -34
70   AVAR  70  103          -33
147  RMGL  62   90          -28
26   RIMR  68   95          -27
146  RMGR  60   86          -26
143  AVKR  53   78          -25
76   AVAL  73   97          -24

Practise Exercise 6

  1. Calculate the differences in out-degrees between the L1 and YA network

  2. Which nodes experienced the greatest changes in out-degree centrality between the L1 and YA networks? Please list the top 10 nodes with the most significant changes.

Answer:

PYTHON

# 1. Calculate differences in out-degree
Outdegree_df = pd.merge(L1_outdegree_df, 
                     YA_outdegree_df, 
                     on='Node', ## joining by node name
                     how='inner') ## Makes sure only taking in common nodes
Outdegree_df.columns = ["Node","L1","YA"]
Outdegree_df["Differences"] = Outdegree_df["L1"] - Outdegree_df["YA"]

PYTHON

# 2. Get the top 10 most signficant changes:
## Sort by differences in absolute difference value
Outdegree_df = Outdegree_df.reindex(Outdegree_df['Differences'].abs().sort_values(ascending=False).index)

top_changes = Outdegree_df.head(10)["Node"].tolist()
print("Top 10 nodes with most signficant changes:", top_changes)

OUTPUT

Top 10 nodes with most signficant changes: ['ADER', 'ADEL', 'DVC', 'AVBR', 'AVAR', 'RMGL', 'RIMR', 'RMGR', 'AVKR', 'AVAL']

Comparing the locomotion neural circuit in L1 and YA networks:

Locomotion is crucial for C. elegans as it enables them to navigate their environment, find food and avoid harmful stimuli. During the larval stages, C. elegans nematodes exhibit both swimming and crawling behaviors, which are essential for their survival and development. These movements are characterised by rhythmic patterns of body postures, allowing the larvae to move efficiently despite their small size. As the larvae develop, their locomotor abilities become more coordinated and stable, coinciding with the maturation of their neural circuits. This progression in locomotion is vital for the larvae to transition successfully into adulthood, where they continue to rely on these behaviors for various activities.

Let’s explore how the nerual circuits of locomotion-related neurons between L1 and YA nematodes, using shortest path. Of particular interest:

PVCR (PVC-right) PVCL (PVC-left): the command interneuron primarily involved in forward locomotion in C. elegans. It collaborates with AVB interneurons to facilitate forward movement, allowing the worm to navigate its environment efficiently. PVCs also play a role in modulating responses to harsh touch stimuli to the tail, with increased calcium levels observed in these neurons following stimulation. This neuron forms synaptic connections with AVA neurons (AVAR and AVAL), enabling the coordination between forward and backward movements.

AVAR (AVA-right) and AVAL (AVA-left): the command interneurons in C. elegans that play a pivotal role in backward locomotion. These neurons - part of a bilaterally symmetric pair - collaborate to drive the worm’s reversal movements. They receive sensory input and integrate signals to coordinate backward movement, ensuring the worm can respond effectively to environmental stimuli. AVAR and AVAL also interact with other neurons involved in locomotion, such as AVD and AVE, to maintain smooth and coordinated movements.

Image reference.

In the figure above, the anatomical sites of PVC and AVA are shown. They are situated at either ends of the nematode. As C. elegans ages, the neural circuits undergo signficant changes to support the worm’s increasing complexity and behavioral needs. The maturation and refinement of the connectivity is, therefore, required to enhance the corrdination and efficiency of both forward and backward movement.

Let’s observe and explore the locomotion neural circuit in further detail.

PPR from AVA and PVC:

Let’s apply PPR to the AVA and PVC neurons to understand which other neurons may be highly connected to the locomotion circuit, based on the edge weights in the network.

To do so, here’s a personalisation list of all the AVA and PVC neurons:

PYTHON

locomotion_neurons = ["AVAL","AVAR","PVCL","PVCR"]
locomotion_dict_L1 = {}
for i in L1_connectome.nodes():
    if i in locomotion_neurons:
        locomotion_dict_L1[i] = 1
    else:
        locomotion_dict_L1[i] = 0

Let’s then run the PPR algorithm on the L1 connectome:

PYTHON

L1_locomotion_ppr = nx.pagerank(L1_connectome,
                    personalization = locomotion_dict_L1)
L1_locomotion_ppr_df = pd.DataFrame.from_dict(L1_locomotion_ppr, orient = "index")

## Rearrange dataframe result 
L1_locomotion_ppr_df.columns = ["L1_PPR"]
L1_locomotion_ppr_df = L1_locomotion_ppr_df.sort_values(by='L1_PPR', ascending=False)

print(L1_locomotion_ppr_df.head(n = 10))

OUTPUT

        L1_PPR
AVAL  0.047018
PVCR  0.045915
PVCL  0.044773
AVAR  0.044099
AVJR  0.013239
RIH   0.013121
RIML  0.012968
AVBL  0.012522
DVA   0.012106
AVBR  0.011585

And repeat this with the YA connectome:

PYTHON

locomotion_dict_YA = {}
for i in YA_connectome.nodes():
    if i in locomotion_neurons:
        locomotion_dict_YA[i] = 1
    else:
        locomotion_dict_YA[i] = 0
YA_locomotion_ppr = nx.pagerank(YA_connectome,
                    personalization = locomotion_dict_YA)
YA_locomotion_ppr_df = pd.DataFrame.from_dict(YA_locomotion_ppr, orient = "index")

## Sort by PPR result
YA_locomotion_ppr_df.columns = ["YA_PPR"]
YA_locomotion_ppr_df = YA_locomotion_ppr_df.sort_values(by='YA_PPR', ascending=False)

print(YA_locomotion_ppr_df.head(n = 10))

OUTPUT

        YA_PPR
AVAR  0.048689
AVAL  0.047712
PVCL  0.046524
PVCR  0.044718
AVBR  0.020519
AVBL  0.019474
AQR   0.014952
DVA   0.014784
RIML  0.012865
RIMR  0.012341

As shown in the sorted DataFrame, the key top neurons have changed drastically between L1 and YA C. elegans. Again, we can visualise the results in the L1 network.

Let’s begin by defining a function to plot the network nodes size by the result and highlighting node labels of the top n result:

PYTHON

# Function to draw network 
def draw_size(G, pos, measures, measure_name, top_n = 10):
    
    cent = np.fromiter(measures.values(), float)
    sizes = cent / np.max(cent) * 200
    nodes = nx.draw_networkx_nodes(G, pos, node_size=sizes, cmap=plt.cm.plasma, 
                                   node_color=list(measures.values()),
                                   nodelist=measures.keys())
    nodes.set_norm(mcolors.SymLogNorm(linthresh=0.01, linscale=1, base=10))
    edges = nx.draw_networkx_edges(G, pos, edge_color="#80808080")

    ## Add top 10 nodes labels
    measures_df = pd.DataFrame.from_dict(measures, orient = "index")
    measures_df = measures_df.sort_values(by=0, ascending=False)
    labels = measures_df.head(n = top_n).index.tolist()
    labels_dict = {l:l for l in labels}
    nx.draw_networkx_labels(G, 
                        pos, 
                        font_size=12,
                        labels = labels_dict)

    plt.title(measure_name)
    plt.colorbar(nodes)
    plt.axis('off')
    plt.show()

PYTHON

plt.figure(figsize=(12, 8))
pos = nx.kamada_kawai_layout(L1_connectome)
draw_size(L1_connectome, 
        pos, 
        L1_locomotion_ppr, 
        'L1 Network AVA/PVC PPR',
        top_n = 10)

Practise Exercise 7

  1. Generate a plot of the YA network with node size based on the PPR result and labelling the top 20 nodes.

Answer:

PYTHON

# 1. Generate a plot with PPR results
plt.figure(figsize=(12, 8))
pos = nx.kamada_kawai_layout(YA_connectome)
draw_size(YA_connectome, 
        pos, 
        YA_locomotion_ppr, 
        'YA Network AVA/PVC PPR',
        top_n = 20)

Shortest path analysis is another a powerful tool in studying the AVA/PVC circuit within the C. elegans connectome. This analysis helps to identify the most efficient routes for signal transmission between neurons, which is critical to understanding how information flows through the nervous system. By determining the shortest paths, researchers can pinpoint critical nodes and connections that play significant roles in neural communication and behavior. In the context of the AVA/PVC circuit, this method can highlight potential vulnerabilities in the network, where disruptions might lead to functional impairments.

Let’s try and quantify this using the shortest path analysis algorithm, with specific application to AVA-Right and PVC-Left:

PYTHON

## Extract all the shortest paths between AVAR and PVCL in L1
AVAR_PVCL_L1 = nx.all_shortest_paths(L1_connectome, 
                 source="AVAR", 
                 target="PVCL",
                 weight = None)

AVAR_PVCL_L1 = list(AVAR_PVCL_L1)

print(AVAR_PVCL_L1)

OUTPUT

[['AVAR', 'RIH', 'PVCL'], ['AVAR', 'SIAVR', 'PVCL'], ['AVAR', 'SMBDL', 'PVCL'], ['AVAR', 'RIMR', 'PVCL'], ['AVAR', 'AVBR', 'PVCL'], ['AVAR', 'ASHR', 'PVCL'], ['AVAR', 'AVER', 'PVCL'], ['AVAR', 'SIBVR', 'PVCL'], ['AVAR', 'AVBL', 'PVCL'], ['AVAR', 'AVDR', 'PVCL'], ['AVAR', 'AVJL', 'PVCL'], ['AVAR', 'AVHL', 'PVCL'], ['AVAR', 'FLPR', 'PVCL'], ['AVAR', 'ADER', 'PVCL'], ['AVAR', 'FLPL', 'PVCL'], ['AVAR', 'RIGL', 'PVCL'], ['AVAR', 'PVQR', 'PVCL'], ['AVAR', 'PVPR', 'PVCL'], ['AVAR', 'DVA', 'PVCL']]

Note: we have specified that keyword argument weight is set to a value of None, in this case. This is because, although shortest path analysis can take into account edge weights, larger edge weights would be considered as longer distance. And distance is calculated as the sum of the edge weights. In our network, the edge weights represent the number of neural connections between nodes; thus, stronger connections are indicated by larger edge weights. In this scenario, the shortest path algorithm, therefore, would be evaluating the edge weight in contradiction with biological context.

To overcome this, we could summarise the edge weights using the function nx.path_weight:

PYTHON

## Path weight for L1 network
path_weight = {}
for path in AVAR_PVCL_L1:
    pw = nx.path_weight(L1_connectome,
                tuple(path),
                "weight")
    path_weight[tuple(path)] = pw

We can then calculate the average weight of the shortest paths in L1 between AVAR and PCVL:

PYTHON

## First convert the results into a dataframe
L1_pathweight = pd.DataFrame.from_dict(path_weight, orient = "index")
L1_pathweight.columns = ["weight"]

## Then calculate the mean value
L1_mean_pathweight = L1_pathweight["weight"].mean()
print(L1_mean_pathweight)

OUTPUT

213650.52631578947

Similarly, let’s repeat the analysis on the YA network:

PYTHON

## Extract the shortest paths between AVAR and PVCL in YA
AVAR_PVCL_YA = nx.all_shortest_paths(YA_connectome, 
                 source="PVCL", 
                 target="AVAR",
                  weight = None)
AVAR_PVCL_YA = list(AVAR_PVCL_YA)

## Calculate path weight for YA network
path_weight = {}
for path in AVAR_PVCL_YA:
    pw = nx.path_weight(YA_connectome,
                tuple(path),
                "weight")
    path_weight[tuple(path)] = pw

## Calculate the average path weight
YA_pathweight = pd.DataFrame.from_dict(path_weight, orient = "index")
YA_pathweight.columns = ["weight"]
YA_mean_pathweight = YA_pathweight["weight"].mean()
print(YA_mean_pathweight)

OUTPUT

693120.0

Unsurprisingly, the shortest path between AVAR and PVCL has reduced in YA compared to L1, suggesting more efficient connectivty between the neurons responsible for forward and backward locomotion.

Analysing the average path weight reveals that the YA AVAR-PVCL axis has more connections compared to the L1 AVAR-PVCL axis. This supports the hypothesis that the locomotion link between AVAR and PVCL is stronger in young adults than in larval L1 stages: both in terms of the number of connections and the number of edges.

Practise Exercise 8

  1. Repeat the above analysis, but this time between PVCL and AVAR in the L1 network. What is the average shortest path weight?

  2. And now for PVCL and AVAR in the YA network. What is the average shortest path weight?

Answer:

PYTHON

## 1. Extract the shortest paths between PVCL and AVAR in L1
PVCL_AVAR_L1 = nx.all_shortest_paths(L1_connectome, 
                 source="PVCL", 
                 target="AVAR",
                  weight = None)
PVCL_AVAR_L1 = list(PVCL_AVAR_L1)

## Calculate path weight for L1 network
path_weight = {}
for path in PVCL_AVAR_L1:
    pw = nx.path_weight(L1_connectome,
                tuple(path),
                "weight")
    path_weight[tuple(path)] = pw

## Calculate the average path weight
L1_PA_pathweight = pd.DataFrame.from_dict(path_weight, orient = "index")
L1_PA_pathweight.columns = ["weight"]
L1_PA_pathweight = L1_PA_pathweight["weight"].mean()
print(L1_PA_pathweight)

OUTPUT

213650.52631578947

PYTHON

## 2. Extract the shortest paths between PVCL and AVAR in YA
PVCL_AVAR_YA= nx.all_shortest_paths(YA_connectome, 
                 source="PVCL", 
                 target="AVAR",
                  weight = None)
PVCL_AVAR_YA = list(PVCL_AVAR_YA)

## Calculate path weight for YA network
path_weight = {}
for path in PVCL_AVAR_YA:
    pw = nx.path_weight(YA_connectome,
                tuple(path),
                "weight")
    path_weight[tuple(path)] = pw

## Calculate the average path weight
YA_PA_pathweight = pd.DataFrame.from_dict(path_weight, orient = "index")
YA_PA_pathweight.columns = ["weight"]
YA_PA_pathweight = YA_PA_pathweight["weight"].mean()
print(YA_PA_pathweight)

OUTPUT

693120.0

Modulating the AVA/PVC axis:

Based on the shortest path analysis between AVAR and PVCL in the L1 network, we can see that some of the nodes connecting AVAR and PVCL also rank highly in the AVA/PVC PPR analysis. Thus, these nodes may have an important role to the AVA/PVC axis in the L1 network.

Let’s explore the impact of removing these nodes to the shortest path connection, and the PPR results of the L1 network.

Firstly, let’s identify the top node in the AVAR-PVCL shortest path overlapping as the top node in the PPR result:

PYTHON

# Flatten the nested list in the shortest path analysis:
L1_SP_nodes = [item for sublist in AVAR_PVCL_L1 for item in sublist]
L1_SP_nodes = list(set(L1_SP_nodes))
L1_SP_nodes

OUTPUT

['AVHL', 'AVBR', 'RIGL', 'PVQR', 'ADER', 'FLPL', 'AVAR', 'ASHR', 'PVCL', 'RIMR', 'SMBDL', 'PVPR', 'SIBVR', 'RIH', 'AVBL', 'SIAVR', 'AVJL', 'AVER', 'DVA', 'FLPR', 'AVDR']

PYTHON

# From the list of nodes, remove AVAR and PVCL, and extract their PPR scores:
L1_SP_PPR = {}
for node in L1_SP_nodes:
    if node in ["AVAR","PVCL"]:
        continue # Skip the function
    else: 
        L1_SP_PPR[node] = L1_locomotion_ppr[node]

# Convert results to dataframe 
L1_SP_PPR_df = pd.DataFrame.from_dict(L1_SP_PPR,
                                   orient = "index")

# Sort dataframe by PageRank result
L1_SP_PPR_df = L1_SP_PPR_df.sort_values(by=0, ascending=False)
L1_SP_PPR_df

OUTPUT

              0
RIH    0.013121
AVBL   0.012522
DVA    0.012106
AVBR   0.011585
RIMR   0.011334
AVJL   0.011015
AVDR   0.007838
RIGL   0.007752
SIBVR  0.006871
SMBDL  0.006143
ASHR   0.006121
AVER   0.006072
FLPL   0.005934
PVPR   0.005701
SIAVR  0.005324
AVHL   0.004963
FLPR   0.004755
ADER   0.004169
PVQR   0.003647

Thus, the top node is RIH. Let’s remove it from the L1 network and see how this impacts the L1 network in AVA/PVC PPR:

PYTHON

# Make a copy of the graph
L1_connectome_new = L1_connectome.copy()

# Remove a node from the copied graph
L1_connectome_new.remove_node("RIH")

# Print the remaining nodes and edges
print("Nodes:", L1_connectome_new.number_of_nodes())

OUTPUT

Nodes: 199

PYTHON

print("Edges:", L1_connectome_new.number_of_edges())

OUTPUT

Edges: 9473

Let’s calculate the AVA/PVC PPR, again:

PYTHON

locomotion_dict_new_L1 = {}
for i in L1_connectome_new.nodes():
    if i in locomotion_neurons:
        locomotion_dict_new_L1[i] = 1
    else:
        locomotion_dict_new_L1[i] = 0
L1_locomotion_ppr_new = nx.pagerank(L1_connectome_new,
                    personalization = locomotion_dict_new_L1)
L1_locomotion_ppr_new_df = pd.DataFrame.from_dict(L1_locomotion_ppr_new, orient = "index")

## Sort by PPR result
L1_locomotion_ppr_new_df.columns = ["New_PPR"]
L1_locomotion_ppr_new_df = L1_locomotion_ppr_new_df.sort_values(by='New_PPR', ascending=False)

print(L1_locomotion_ppr_new_df.head(n = 10))

OUTPUT

       New_PPR
AVAL  0.047401
PVCR  0.046554
PVCL  0.045082
AVAR  0.044453
AVJR  0.013856
RIML  0.013240
AVBL  0.013022
DVA   0.012541
AVBR  0.012102
AVJL  0.011725

Let’s explore the top changes, as before:

PYTHON

PPR_df = pd.merge(L1_locomotion_ppr_df, 
                  L1_locomotion_ppr_new_df, 
                  left_index=True, right_index=True, ## joining by index
                  how='inner') ## Makes sure only taking in common nodes
PPR_df.columns = ["PPR","New_PPR"]
PPR_df["Differences"] = PPR_df["New_PPR"] - PPR_df["PPR"]

PPR_df = PPR_df.reindex(PPR_df['Differences'].abs().sort_values(ascending=False).index)

top_changes = PPR_df.head(10).index.tolist()
print("Top 10 nodes with most signficant changes:", top_changes)

OUTPUT

Top 10 nodes with most signficant changes: ['AVJL', 'PVCR', 'AVJR', 'OLQVR', 'OLQVL', 'AVBR', 'AVBL', 'RIPR', 'IL2DR', 'RIFR']

And then quantify the shortest path weights, again:

PYTHON

AVAR_PVCL_L1_new = nx.all_shortest_paths(L1_connectome_new, 
                 source="AVAR", 
                 target="PVCL",
                  weight = None)
AVAR_PVCL_L1_new = list(AVAR_PVCL_L1_new)

## Calculate path weight for L1 network
path_weight = {}
for path in AVAR_PVCL_L1_new:
    pw = nx.path_weight(L1_connectome_new,
                tuple(path),
                "weight")
    path_weight[tuple(path)] = pw

## Calculate the average path weight
L1_PA_pathweight_new = pd.DataFrame.from_dict(path_weight, orient = "index")
L1_PA_pathweight_new.columns = ["weight"]
L1_PA_pathweight_new = L1_PA_pathweight_new["weight"].mean()
print(L1_PA_pathweight_new)

OUTPUT

202106.66666666666

Assignment Questions (Nematode Example):


End of chapter Exercises (Nematode Example):

NOTE: You have one of two options for this assignment. If you choose the neural signal propagation in C. elegans nematode worms example, please complete and submit this assignment only (and not the cancer network example).

  1. Read in the network for L3 of the C. elegans connectome and calculate the number of nodes and edges in the network.

  2. Calculate the Personalised PageRank of L3 from the AVA/PVC axis. Which are the top 10 nodes by PPR score?

  3. Calculate all shortest paths and their associated edge weights between AVAL and PVCR. What is the average edge weight of the shortest path?

  4. Calculate all shortest paths and their associated edge weights between PVCR and AVAL. What is the average edge weight of the shortest path?

  5. Based on the shortest path results and the PPR results, which node being removed would have the greatest impact on the AVA/PVC axis in this network? Please give your answer by generating a new network and calculating a new set of PPR and shortest path results.

Key Points

  • Construction of meaningful biological networks
  • Using network analyses to answer biological questions
  • Interpreting network analyses results in biological contexts
  • Learning to select appropriate network analyses based on the research question
  • Two case examples in the form of cancer networks and the C. elegans connectome